Commit Graph

98 Commits

Author SHA1 Message Date
Chris van Meer
c7c8bd4ce2 Updates to the UI block (#16328)
1. On the Consul address, following the recommendation for the HTTPS
   API on port 8501.
2. Add the hint to use HEX values for the colors.
2023-04-18 18:28:17 -07:00
Tim Gross
c3002db815 client: allow drain_on_shutdown configuration (#16827)
Adds a new configuration to clients to optionally allow them to drain their
workloads on shutdown. The client sends the `Node.UpdateDrain` RPC targeting
itself and then monitors the drain state as seen by the server until the drain
is complete or the deadline expires. If it loses connection with the server, it
will monitor local client status instead to ensure allocations are stopped
before exiting.
2023-04-14 15:35:32 -04:00
Seth Hoenig
2c44cbb001 api: enable support for setting original job source (#16763)
* api: enable support for setting original source alongside job

This PR adds support for setting job source material along with
the registration of a job.

This includes a new HTTP endpoint and a new RPC endpoint for
making queries for the original source of a job. The
HTTP endpoint is /v1/job/<id>/submission?version=<version> and
the RPC method is Job.GetJobSubmission.

The job source (if submitted, and doing so is always optional), is
stored in the job_submission memdb table, separately from the
actual job. This way we do not incur overhead of reading the large
string field throughout normal job operations.

The server config now includes job_max_source_size for configuring
the maximum size the job source may be, before the server simply
drops the source material. This should help prevent Bad Things from
happening when huge jobs are submitted. If the value is set to 0,
all job source material will be dropped.

* api: avoid writing var content to disk for parsing

* api: move submission validation into RPC layer

* api: return an error if updating a job submission without namespace or job id

* api: be exact about the job index we associate a submission with (modify)

* api: reword api docs scheduling

* api: prune all but the last 6 job submissions

* api: protect against nil job submission in job validation

* api: set max job source size in test server

* api: fixups from pr
2023-04-11 08:45:08 -05:00
Tim Gross
be0678bd84 docs: fix template retry attempts default documentation (#16667)
The configuration docs for `client.template.vault_retry`, `consul_retry`, and
`nomad_retry` incorrectly document the default number of attempts to be
unlimited (0). When we added these config blocks, we defaulted the fields to
`nil` for backwards compatibility, which causes them to fall back to the default
consul-template configuration values.
2023-03-28 08:27:06 -04:00
James Rasell
36825576b4 docs: fix-up legacy link in client config page. (#16678) 2023-03-28 09:32:34 +01:00
Tobias Birkefeld
39e7a7e01f docs: fix link of Read Stats API (#16673)
The former link results in a 404. Update the link to the correct developer docs.
2023-03-28 08:49:44 +01:00
Michael Schurter
282e3bcfcc Enable ACLs on E2E test clients (#16530)
* e2e: uniformly enable acls across all agents

* docs: clarify that acls should be set everywhere
2023-03-16 14:22:41 -07:00
Alessio Perugini
365ccf4377 Allow configurable range of Job priorities (#16084) 2023-02-17 09:23:13 -05:00
visweshs123
7d4ccf11bc csi: add option to configure CSIVolumeClaimGCInterval (#16195) 2023-02-16 10:41:15 -05:00
Seth Hoenig
511d0c1e70 artifact: protect against unbounded artifact decompression (1.5.0) (#16151)
* artifact: protect against unbounded artifact decompression

Starting with 1.5.0, set defaut values for artifact decompression limits.

artifact.decompression_size_limit (default "100GB") - the maximum amount of
data that will be decompressed before triggering an error and cancelling
the operation

artifact.decompression_file_count_limit (default 4096) - the maximum number
of files that will be decompressed before triggering an error and
cancelling the operation.

* artifact: assert limits cannot be nil in validation
2023-02-14 09:28:39 -06:00
Tim Gross
4fd3c17a79 docs: update example license_path (#16082)
In #13374 we updated the commented-out `license_path` in the packaged example
configuration file to match the existing documentation. Although this config
value was commented-out, it was reported that changing the value was
confusing. Update the commented-out line to the previous value and update the
documented examples to match that. This matches most of the examples for
Consul/Vault licensing as well. I've double-checked the tutorials and it looks
like it'd been left on the previous value there, so no additional work to be
done.
2023-02-07 16:28:51 -05:00
Tim Gross
6145cdcd11 cli: remove deprecated keyring and keygen commands (#16068)
These command were marked as deprecated in 1.4.0 with intent to remove in
1.5.0. Remove them and clean up the docs.
2023-02-07 09:49:52 -05:00
Phil Renaud
d71fc9565c Label for the Web UI (#16006)
* Demoable state

* Demo mirage color

* Label as a block with foreground and background colours

* Test mock updates

* Go test updated

* Documentation update for label support
2023-02-02 16:29:04 -05:00
Charlie Voiselle
55df5af4aa client: Add option to enable hairpinMode on Nomad bridge (#15961)
* Add `bridge_network_hairpin_mode` client config setting
* Add node attribute: `nomad.bridge.hairpin_mode`
* Changed format string to use `%q` to escape user provided data
* Add test to validate template JSON for developer safety

Co-authored-by: Daniel Bennett <dbennett@hashicorp.com>
2023-02-02 10:12:15 -05:00
stswidwinski
2285432424 GC: ensure no leakage of evaluations for batch jobs. (#15097)
Prior to 2409f72 the code compared the modification index of a job to itself. Afterwards, the code compared the creation index of the job to itself. In either case there should never be a case of re-parenting of allocs causing the evaluation to trivially always result in false, which leads to unreclaimable memory.

Prior to this change allocations and evaluations for batch jobs were never garbage collected until the batch job was explicitly stopped. The new `batch_eval_gc_threshold` server configuration controls how often they are collected. The default threshold is `24h`.
2023-01-31 13:32:14 -05:00
Piotr Kazmierczak
949a6f60c7 renamed stanza to block for consistency with other projects (#15941) 2023-01-30 15:48:43 +01:00
Ashlee M Boyer
3444ece549 docs: Migrate link formats (#15779)
* Adding check-legacy-links-format workflow

* Adding test-link-rewrites workflow

* chore: updates link checker workflow hash

* Migrating links to new format

Co-authored-by: Kendall Strautman <kendallstrautman@gmail.com>
2023-01-25 09:31:14 -08:00
Karl Johann Schubert
588392cabc client: add disk_total_mb and disk_free_mb config options (#15852) 2023-01-24 09:14:22 -05:00
Charlie Voiselle
85f67d4a83 Add raft snapshot configuration options (#15522)
* Add config elements
* Wire in snapshot configuration to raft
* Add hot reload of raft config
* Add documentation for new raft settings
* Add changelog
2023-01-20 14:21:51 -05:00
Seth Hoenig
c3017da6af consul: add client configuration for grpc_ca_file (#15701)
* [no ci] first pass at plumbing grpc_ca_file

* consul: add support for grpc_ca_file for tls grpc connections in consul 1.14+

This PR adds client config to Nomad for specifying consul.grpc_ca_file

These changes combined with https://github.com/hashicorp/consul/pull/15913 should
finally enable Nomad users to upgrade to Consul 1.14+ and use tls grpc connections.

* consul: add cl entgry for grpc_ca_file

* docs: mention grpc_tls changes due to Consul 1.14
2023-01-11 09:34:28 -06:00
Luiz Aoqui
b72c79ebb9 docs: networking (#15358)
Co-authored-by: Charlie Voiselle <464492+angrycub@users.noreply.github.com>
2023-01-06 11:47:10 -05:00
Dao Thanh Tung
f89ac80801 agent: Make agent syslog log level inherit from Nomad agent log (#15625) 2023-01-04 09:38:06 -05:00
Seth Hoenig
493389e861 artifact: enable inheriting environment variables from client (#15514)
* artifact: enable inheriting environment variables from client

This PR adds client configuration for specifying environment variables that
should be inherited by the artifact sandbox process from the Nomad Client agent.

Most users should not need to set these values but the configuration is provided
to ensure backwards compatability. Configuration of go-getter should ideally be
done through the artifact block in a jobspec task.

e.g.

```hcl
client {
  artifact {
    set_environment_variables = "TMPDIR,GIT_SSH_OPTS"
  }
}
```

Closes #15498

* website: update set_environment_variables text to mention PATH
2022-12-09 15:46:07 -06:00
Seth Hoenig
990537e8ba artifact: add client toggle to disable filesystem isolation (#15503)
This PR adds the client config option for turning off filesystem isolation,
applicable on Linux systems where filesystem isolation is possible and
enabled by default.

```hcl
client{
  artifact {
    disable_filesystem_isolation = <bool:false>
  }
}
```

Closes #15496
2022-12-08 12:29:23 -06:00
James Rasell
847c2cc528 client: accommodate Consul 1.14.0 gRPC and agent self changes. (#15309)
* client: accommodate Consul 1.14.0 gRPC and agent self changes.

Consul 1.14.0 changed the way in which gRPC listeners are
configured, particularly when using TLS. Prior to the change, a
single listener was responsible for handling plain-text and
encrypted gRPC requests. In 1.14.0 and beyond, separate listeners
will be used for each, defaulting to 8502 and 8503 for plain-text
and TLS respectively.

The change means that Nomad’s Consul Connect integration would not
work when integrated with Consul clusters using TLS and running
1.14.0 or greater.

The Nomad Consul fingerprinter identifies the gRPC port Consul has
exposed using the "DebugConfig.GRPCPort" value from Consul’s
“/v1/agent/self” endpoint. In Consul 1.14.0 and greater, this only
represents the plain-text gRPC port which is likely to be disbaled
in clusters running TLS. In order to fix this issue, Nomad now
takes into account the Consul version and configured scheme to
optionally use “DebugConfig.GRPCTLSPort” value from Consul’s agent
self return.

The “consul_grcp_socket” allocrunner hook has also been updated so
that the fingerprinted gRPC port attribute is passed in. This
provides a better fallback method, when the operator does not
configure the “consul.grpc_address” option.

* docs: modify Consul Connect entries to detail 1.14.0 changes.

* changelog: add entry for #15309

* fixup: tidy tests and clean version match from review feedback.

* fixup: use strings tolower func.
2022-11-21 09:19:09 -06:00
Tim Gross
6b2da83f6a keyring: safely handle missing keys and restore GC (#15092)
When replication of a single key fails, the replication loop breaks early and
therefore keys that fall later in the sorting order will never get
replicated. This is particularly a problem for clusters impacted by the bug that
caused #14981 and that were later upgraded; the keys that were never replicated
can now never be replicated, and so we need to handle them safely.

Included in the replication fix:
* Refactor the replication loop so that each key replicated in a function call
  that returns an error, to make the workflow more clear and reduce nesting. Log
  the error and continue.
* Improve stability of keyring replication tests. We no longer block leadership
  on initializing the keyring, so there's a race condition in the keyring tests
  where we can test for the existence of the root key before the keyring has
  been initialize. Change this to an "eventually" test.

But these fixes aren't enough to fix #14981 because they'll end up seeing an
error once a second complaining about the missing key, so we also need to fix
keyring GC so the keys can be removed from the state store. Now we'll store the
key ID used to sign a workload identity in the Allocation, and we'll index the
Allocation table on that so we can track whether any live Allocation was signed
with a particular key ID.
2022-11-01 15:00:50 -04:00
Tim Gross
b583f7822a keyring: remove root key GC (#15034) 2022-10-25 17:06:18 -04:00
James Rasell
eaea9164a5 acl: correctly resolve ACL roles within client cache. (#14922)
The client ACL cache was not accounting for tokens which included
ACL role links. This change modifies the behaviour to resolve role
links to policies. It will also now store ACL roles within the
cache for quick lookup. The cache TTL is configurable in the same
manner as policies or tokens.

Another small fix is included that takes into account the ACL
token expiry time. This was not included, which meant tokens with
expiry could be used past the expiry time, until they were GC'd.
2022-10-20 09:37:32 +02:00
Anthony
6dcf008fbb Updated datacenter block description (#14953)
* Updated datacenter block description

* Replacing accidentally removed title

* docs: add closing period

Co-authored-by: Seth Hoenig <shoenig@duck.com>
2022-10-19 08:44:52 -05:00
Bryce Kalow
f49b3a95dd website: fixes redirected links (#14918) 2022-10-18 10:31:52 -05:00
Kevin Wang
57dc7c2ab1 fix: website broken links (#14904)
* fix: website broken links

* fix up keyring-rotate link

Co-authored-by: Tim Gross <tgross@hashicorp.com>
2022-10-17 11:32:10 -04:00
Tim Gross
fb1f5ea2d9 Revert removing deprecated client options docs (#14753)
This reverts PR #12416 and commit 6668ce022a.

While the driver options are well and truly deprecated, this documentation also
covers features like `fingerprint.denylist` that are not available any other
way. Let's revert this until #12420 is ready.
2022-09-30 08:38:03 -04:00
Michael Schurter
a6dc5ea585 docs: write a lot of words about heartbeats (#14679)
* docs: write a lot of words about heartbeats

Alternative to #14670

* Apply suggestions from code review

Co-authored-by: Tim Gross <tgross@hashicorp.com>

* use descriptive title for link

* rework example of high failover ttl

Co-authored-by: Tim Gross <tgross@hashicorp.com>
2022-09-26 14:43:34 -07:00
Bryce Kalow
67d39725b1 website: content updates for developer (#14473)
Co-authored-by: Geoffrey Grosenbach <26+topfunky@users.noreply.github.com>
Co-authored-by: Anthony <russo555@gmail.com>
Co-authored-by: Ashlee Boyer <ashlee.boyer@hashicorp.com>
Co-authored-by: Ashlee M Boyer <43934258+ashleemboyer@users.noreply.github.com>
Co-authored-by: HashiBot <62622282+hashibot-web@users.noreply.github.com>
Co-authored-by: Kevin Wang <kwangsan@gmail.com>
2022-09-16 10:38:39 -05:00
Luiz Aoqui
4f9beb0b39 docs: add warning about changing region config (#14443) 2022-09-01 16:47:06 -04:00
James Rasell
20f71cf3e4 docs: add documentation for ACL token expiration and ACL roles. (#14332)
The ACL command docs are now found within a sub-dir like the
operator command docs. Updates to the ACL token commands to
accommodate token expiry have also been added.

The ACL API docs are now found within a sub-dir like the operator
API docs. The ACL docs now include the ACL roles endpoint as well
as updated ACL token endpoints for token expiration.

The configuration section is also updated to accommodate the new
ACL and server parameters for the new ACL features.
2022-08-31 16:13:47 +02:00
Derek Strickland
696deb9600 Add Nomad RetryConfig to agent template config (#13907)
* add Nomad RetryConfig to agent template config
2022-08-03 16:56:30 -04:00
Tim Gross
bf6116f5dd docs: document secure variables server config options (#13695) 2022-07-20 14:13:39 -04:00
Luiz Aoqui
d456cc1e7f Track plan rejection history and automatically mark clients as ineligible (#13421)
Plan rejections occur when the scheduler work and the leader plan
applier disagree on the feasibility of a plan. This may happen for valid
reasons: since Nomad does parallel scheduling, it is expected that
different workers will have a different state when computing placements.

As the final plan reaches the leader plan applier, it may no longer be
valid due to a concurrent scheduling taking up intended resources. In
these situations the plan applier will notify the worker that the plan
was rejected and that they should refresh their state before trying
again.

In some rare and unexpected circumstances it has been observed that
workers will repeatedly submit the same plan, even if they are always
rejected.

While the root cause is still unknown this mitigation has been put in
place. The plan applier will now track the history of plan rejections
per client and include in the plan result a list of node IDs that should
be set as ineligible if the number of rejections in a given time window
crosses a certain threshold. The window size and threshold value can be
adjusted in the server configuration.

To avoid marking several nodes as ineligible at one, the operation is rate
limited to 5 nodes every 30min, with an initial burst of 10 operations.
2022-07-12 18:40:20 -04:00
Michael Schurter
f998a2b77b core: merge reserved_ports into host_networks (#13651)
Fixes #13505

This fixes #13505 by treating reserved_ports like we treat a lot of jobspec settings: merging settings from more global stanzas (client.reserved.reserved_ports) "down" into more specific stanzas (client.host_networks[].reserved_ports).

As discussed in #13505 there are other options, and since it's totally broken right now we have some flexibility:

Treat overlapping reserved_ports on addresses as invalid and refuse to start agents. However, I'm not sure there's a cohesive model we want to publish right now since so much 0.9-0.12 compat code still exists! We would have to explain to folks that if their -network-interface and host_network addresses overlapped, they could only specify reserved_ports in one place or the other?! It gets ugly.
Use the global client.reserved.reserved_ports value as the default and treat host_network[].reserverd_ports as overrides. My first suggestion in the issue, but @groggemans made me realize the addresses on the agent's interface (as configured by -network-interface) may overlap with host_networks, so you'd need to remove the global reserved_ports from addresses shared with a shared network?! This seemed really confusing and subtle for users to me.
So I think "merging down" creates the most expressive yet understandable approach. I've played around with it a bit, and it doesn't seem too surprising. The only frustrating part is how difficult it is to observe the available addresses and ports on a node! However that's a job for another PR.
2022-07-12 14:40:25 -07:00
James Rasell
24220d0a02 core: allow pausing and un-pausing of leader broker routine (#13045)
* core: allow pause/un-pause of eval broker on region leader.

* agent: add ability to pause eval broker via scheduler config.

* cli: add operator scheduler commands to interact with config.

* api: add ability to pause eval broker via scheduler config

* e2e: add operator scheduler test for eval broker pause.

* docs: include new opertor scheduler CLI and pause eval API info.
2022-07-06 16:13:48 +02:00
Seth Hoenig
f1cafd0789 core: remove support for raft protocol version 2
This PR checks server config for raft_protocol, which must now
be set to 3 or unset (0). When unset, version 3 is used as the
default.
2022-06-23 14:37:50 +00:00
Nick Wales
a8dca34a3a Updates TLS documentation 2022-06-16 12:15:40 -05:00
Derek Strickland
dd71afb891 template: improve default language for max_stale and wait (#13334)
* template: improve default language for max_stale and wait

Co-authored-by: Michael Schurter <mschurter@hashicorp.com>
2022-06-10 14:34:25 -04:00
Derek Strickland
7899fd3fac consul-template: Add fault tolerant defaults (#13041)
consul-template: Add fault tolerant defaults

Co-authored-by: Tim Gross <tgross@hashicorp.com>
2022-06-08 14:08:25 -04:00
Michael Schurter
3968509886 artifact: fix numerous go-getter security issues
Fix numerous go-getter security issues:

- Add timeouts to http, git, and hg operations to prevent DoS
- Add size limit to http to prevent resource exhaustion
- Disable following symlinks in both artifacts and `job run`
- Stop performing initial HEAD request to avoid file corruption on
  retries and DoS opportunities.

**Approach**

Since Nomad has no ability to differentiate a DoS-via-large-artifact vs
a legitimate workload, all of the new limits are configurable at the
client agent level.

The max size of HTTP downloads is also exposed as a node attribute so
that if some workloads have large artifacts they can specify a high
limit in their jobspecs.

In the future all of this plumbing could be extended to enable/disable
specific getters or artifact downloading entirely on a per-node basis.
2022-05-24 16:29:39 -04:00
Luiz Aoqui
0abe5a6c79 vault: revert support for entity aliases (#12723)
After a more detailed analysis of this feature, the approach taken in
PR #12449 was found to be not ideal due to poor UX (users are
responsible for setting the entity alias they would like to use) and
issues around jobs potentially masquerading itself as another Vault
entity.
2022-04-22 10:46:34 -04:00
Luiz Aoqui
d412f7b497 Support Vault entity aliases (#12449)
Move some common Vault API data struct decoding out of the Vault client
so it can be reused in other situations.

Make Vault job validation its own function so it's easier to expand it.

Rename the `Job.VaultPolicies` method to just `Job.Vault` since it
returns the full Vault block, not just their policies.

Set `ChangeMode` on `Vault.Canonicalize`.

Add some missing tests.

Allows specifying an entity alias that will be used by Nomad when
deriving the task Vault token.

An entity alias assigns an indentity to a token, allowing better control
and management of Vault clients since all tokens with the same indentity
alias will now be considered the same client. This helps track Nomad
activity in Vault's audit logs and better control over Vault billing.

Add support for a new Nomad server configuration to define a default
entity alias to be used when deriving Vault tokens. This default value
will be used if the task doesn't have an entity alias defined.
2022-04-05 14:18:10 -04:00
Tim Gross
6668ce022a docs: remove deprecated client options parameters docs (#12416)
The client configuration options for drivers have been deprecated
since 0.9. We haven't torn them out completely but because they're
deprecated it's been hard to guarantee correct behavior. Remove the
documentation so that users aren't misled about their viability.
2022-03-31 11:45:51 -04:00
Michael Schurter
3ca38ee4ed template: fix comments and docs
Review notes from @lgfa29

Co-authored-by: Luiz Aoqui <luiz@hashicorp.com>
2022-03-29 09:25:23 -07:00