Commit Graph

2003 Commits

Author SHA1 Message Date
Luiz Aoqui
939d643fec Post 1.3.3 release (#14064)
* Generate files for 1.3.3 release

* Prepare for next release

* Merge release 1.3.3 files

Co-authored-by: hc-github-team-nomad-core <github-team-nomad-core@hashicorp.com>
2022-08-09 17:27:29 -04:00
Derek Strickland
696deb9600 Add Nomad RetryConfig to agent template config (#13907)
* add Nomad RetryConfig to agent template config
2022-08-03 16:56:30 -04:00
Piotr Kazmierczak
2e0b875b14 client: enable specifying user/group permissions in the template stanza (#13755)
* Adds Uid/Gid parameters to template.

* Updated diff_test

* fixed order

* update jobspec and api

* removed obsolete code

* helper functions for jobspec parse test

* updated documentation

* adjusted API jobs test.

* propagate uid/gid setting to job_endpoint

* adjusted job_endpoint tests

* making uid/gid into pointers

* refactor

* updated documentation

* updated documentation

* Update client/allocrunner/taskrunner/template/template_test.go

Co-authored-by: Luiz Aoqui <luiz@hashicorp.com>

* Update website/content/api-docs/json-jobs.mdx

Co-authored-by: Luiz Aoqui <luiz@hashicorp.com>

* propagating documentation change from Luiz

* formatting

* changelog entry

* changed changelog entry

Co-authored-by: Luiz Aoqui <luiz@hashicorp.com>
2022-08-02 22:15:38 +02:00
Eric Weber
07bbf1f91e Add stage_publish_base_dir field to csi_plugin stanza of a job (#13919)
* Allow specification of CSI staging and publishing directory path
* Add website documentation for stage_publish_dir
* Replace erroneous reference to csi_plugin.mount_config with csi_plugin.mount_dir
* Avoid requiring CSI plugins to be redeployed after introducing StagePublishDir
2022-08-02 09:42:44 -04:00
Tim Gross
82861ae8d7 secure vars: enforce ENT quotas (OSS work) (#13951)
Move the secure variables quota enforcement calls into the state store to ensure
quota checks are atomic with quota updates (in the same transaction).

Switch to a machine-size int instead of a uint64 for quota tracking. The
ENT-side quota spec is described as int, and negative values have a meaning as
"not permitted at all". Using the same type for tracking will make it easier to
the math around checks, and uint64 is infeasibly large anyways.

Add secure vars to quota HTTP API and CLI outputs and API docs.
2022-08-02 09:32:09 -04:00
Tim Gross
e4cceab4f0 fix flaky TestAgent_ProxyRPC_Dev test (#13925)
This test is a fairly trivial test of the agent RPC, but the test setup waits
for a short fixed window after the node starts to send the RPC. After looking at
detailed logs for recent test failures, it looks like the node registration for
the first node doesn't get a chance to happen before we make the RPC call. Use
`WaitForResultUntil` to give the test more time to run in slower test
environments, while allowing it to finish quickly if possible.
2022-07-28 14:47:15 -04:00
Lars Lehtonen
5d8258ecab testing: fix dropped test errors in command/agent (#13926) 2022-07-28 11:04:31 -04:00
Seth Hoenig
61e885dfb3 cleanup: use constants for on_update values 2022-07-21 13:09:47 -05:00
Seth Hoenig
4508af8160 Merge pull request #13715 from hashicorp/dev-nsd-checks
client: add support for checks in nomad services
2022-07-21 10:22:57 -05:00
Seth Hoenig
9f37b84db4 Merge pull request #13870 from hashicorp/exp-fp-optimization
client: use test timeouts for network fingerprinters in dev mode
2022-07-21 08:18:02 -05:00
Tim Gross
33f4f50044 search: use secure vars ACL policy for secure vars context (#13788)
The search RPC used a placeholder policy for searching within the secure
variables context. Now that we have ACL policies built for secure variables, we
can use them for search. Requires a new loose policy for checking if a token has
any secure variables access within a namespace, so that we can filter on
specific paths in the iterator.
2022-07-21 08:39:36 -04:00
Seth Hoenig
74bc3dd120 devmode: use minimal network timeouts for network fingerprinters in dev mode 2022-07-20 15:13:14 -05:00
Will Jordan
662a12a41e Return 429 response on HTTP max connection limit (#13621)
Return 429 response on HTTP max connection limit. Instead of silently closing
the connection, return a `429 Too Many Requests` HTTP response with a helpful
error message to aid debugging when the connection limit is unintentionally
reached.

Set a 10-millisecond write timeout and rate limiter for connection-limit 429
response to prevent writing the HTTP response from consuming too many server
resources.

Add `nomad.agent.http.exceeded metric` counting the number of HTTP connections
exceeding concurrency limit.
2022-07-20 14:12:21 -04:00
hc-github-team-nomad-core
5f8889d522 Generate files for 1.3.2 release 2022-07-13 19:33:41 -04:00
Michael Schurter
d857be3c45 http: only log alloc/exec errors when non-nil (#13730) 2022-07-13 09:44:51 -07:00
Luiz Aoqui
d456cc1e7f Track plan rejection history and automatically mark clients as ineligible (#13421)
Plan rejections occur when the scheduler work and the leader plan
applier disagree on the feasibility of a plan. This may happen for valid
reasons: since Nomad does parallel scheduling, it is expected that
different workers will have a different state when computing placements.

As the final plan reaches the leader plan applier, it may no longer be
valid due to a concurrent scheduling taking up intended resources. In
these situations the plan applier will notify the worker that the plan
was rejected and that they should refresh their state before trying
again.

In some rare and unexpected circumstances it has been observed that
workers will repeatedly submit the same plan, even if they are always
rejected.

While the root cause is still unknown this mitigation has been put in
place. The plan applier will now track the history of plan rejections
per client and include in the plan result a list of node IDs that should
be set as ineligible if the number of rejections in a given time window
crosses a certain threshold. The window size and threshold value can be
adjusted in the server configuration.

To avoid marking several nodes as ineligible at one, the operation is rate
limited to 5 nodes every 30min, with an initial burst of 10 operations.
2022-07-12 18:40:20 -04:00
Seth Hoenig
b2861f2a9b client: add support for checks in nomad services
This PR adds support for specifying checks in services registered to
the built-in nomad service provider.

Currently only HTTP and TCP checks are supported, though more types
could be added later.
2022-07-12 17:09:50 -05:00
Michael Schurter
f998a2b77b core: merge reserved_ports into host_networks (#13651)
Fixes #13505

This fixes #13505 by treating reserved_ports like we treat a lot of jobspec settings: merging settings from more global stanzas (client.reserved.reserved_ports) "down" into more specific stanzas (client.host_networks[].reserved_ports).

As discussed in #13505 there are other options, and since it's totally broken right now we have some flexibility:

Treat overlapping reserved_ports on addresses as invalid and refuse to start agents. However, I'm not sure there's a cohesive model we want to publish right now since so much 0.9-0.12 compat code still exists! We would have to explain to folks that if their -network-interface and host_network addresses overlapped, they could only specify reserved_ports in one place or the other?! It gets ugly.
Use the global client.reserved.reserved_ports value as the default and treat host_network[].reserverd_ports as overrides. My first suggestion in the issue, but @groggemans made me realize the addresses on the agent's interface (as configured by -network-interface) may overlap with host_networks, so you'd need to remove the global reserved_ports from addresses shared with a shared network?! This seemed really confusing and subtle for users to me.
So I think "merging down" creates the most expressive yet understandable approach. I've played around with it a bit, and it doesn't seem too surprising. The only frustrating part is how difficult it is to observe the available addresses and ports on a node! However that's a job for another PR.
2022-07-12 14:40:25 -07:00
Tim Gross
8627000301 core job for secure variables re-key (#13440)
When the `Full` flag is passed for key rotation, we kick off a core
job to decrypt and re-encrypt all the secure variables so that they
use the new key.
2022-07-11 13:34:06 -04:00
Charlie Voiselle
ee38ee03aa SV: CAS: Implement Check and Set for Delete and Upsert (#13429)
* SV: CAS
    * Implement Check and Set for Delete and Upsert
    * Reading the conflict from the state store
    * Update endpoint for new error text
    * Updated HTTP api tests
    * Conflicts to the HTTP api

* SV: structs: Update SV time to UnixNanos
    * update mock to UnixNano; refactor

* SV: encrypter: quote KeyID in error
* SV: mock: add mock for namespace w/ SV
2022-07-11 13:34:06 -04:00
Tim Gross
d03fd4b8b1 secure variable server configuration (#13307)
Add fields for configuring root key garbage collection and automatic
rotation. Fix the keystore path so that we write to a tempdir when in
dev mode.
2022-07-11 13:34:06 -04:00
Charlie Voiselle
39dcef8471 Implement HTTP search API for Variables (#13257)
* Add Path only index for SecureVariables
* Add GetSecureVariablesByPrefix; refactor tests
* Add search for SecureVariables
* Add prefix search for secure variables
2022-07-11 13:34:05 -04:00
Charlie Voiselle
75495f420c Secure Variables: Seperate Encrypted and Decrypted structs (#13355)
This PR splits SecureVariable into SecureVariableDecrypted and
SecureVariableEncrypted in order to use the type system to help
verify that cleartext secret material is not committed to file.

* Make Encrypt function return KeyID
* Split SecureVariable

Co-authored-by: Tim Gross <tgross@hashicorp.com>
2022-07-11 13:34:05 -04:00
Charlie Voiselle
a7522d5c48 Don't write a SecureVariable with no Items (#13258) 2022-07-11 13:34:04 -04:00
Tim Gross
b69d1bffa8 remove end-user algorithm selection (#13190)
After internal design review, we decided to remove exposing algorithm
choice to the end-user for the initial release. We'll solve nonce
rotation by forcing rotations automatically on key GC (in a core job,
not included in this changeset). Default to AES-256 GCM for the
following criteria:

* faster implementation when hardware acceleration is available
* FIPS compliant
* implementation in pure go
* post-quantum resistance

Also fixed a bug in the decoding from keystore and switched to a 
harder-to-misuse encoding method.
2022-07-11 13:34:04 -04:00
Tim Gross
4c73f98423 bootstrap keyring (#13124)
When a server becomes leader, it will check if there are any keys in
the state store, and create one if there is not. The key metadata will
be replicated via raft to all followers, who will then get the key
material via key replication (not implemented in this changeset).
2022-07-11 13:34:04 -04:00
Charlie Voiselle
ba74aadb90 Secure Variables: Variables - State store, FSM, RPC (#13098)
* Secure Variables: State Store
* Secure Variables: FSM
* Secure Variables: RPC
* Secure Variables: HTTP API

Co-authored-by: Tim Gross <tgross@hashicorp.com>
2022-07-11 13:34:04 -04:00
Tim Gross
ce8e7f1788 keystore serialization (#13106)
This changeset implements the keystore serialization/deserialization:

* Adds a JSON serialization extension for the `RootKey` struct, along with a metadata stub. When we serialize RootKey to the on-disk keystore, we want to base64 encode the key material but also exclude any frequently-changing fields which are stored in raft.
* Implements methods for loading/saving keys to the keystore.
* Implements methods for restoring the whole keystore from disk.
* Wires it all up with the `Keyring` RPC handlers and fixes up any fallout on tests.
2022-07-11 13:34:04 -04:00
Tim Gross
0b0aa3efe8 keyring HTTP API (#13077) 2022-07-11 13:34:04 -04:00
Charlie Voiselle
15d6dde25c Provide mock secure variables implementation (#12980)
* Add SecureVariable mock
* Add SecureVariableStub
* Add SecureVariable Copy and Stub funcs
2022-07-11 13:34:03 -04:00
James Rasell
d442e1b4c1 agent: test full object when performing test config parse. (#13668) 2022-07-11 16:21:36 +02:00
James Rasell
11cb4c6d82 core: allow deleting of evaluations (#13492)
* core: add eval delete RPC and core functionality.

* agent: add eval delete HTTP endpoint.

* api: add eval delete API functionality.

* cli: add eval delete command.

* docs: add eval delete website documentation.
2022-07-06 16:30:11 +02:00
James Rasell
24220d0a02 core: allow pausing and un-pausing of leader broker routine (#13045)
* core: allow pause/un-pause of eval broker on region leader.

* agent: add ability to pause eval broker via scheduler config.

* cli: add operator scheduler commands to interact with config.

* api: add ability to pause eval broker via scheduler config

* e2e: add operator scheduler test for eval broker pause.

* docs: include new opertor scheduler CLI and pause eval API info.
2022-07-06 16:13:48 +02:00
Seth Hoenig
bdead31863 api: enable selecting subset of services using rendezvous hashing
This PR adds the 'choose' query parameter to the '/v1/service/<service>' endpoint.

The value of 'choose' is in the form '<number>|<key>', number is the number
of desired services and key is a value unique but consistent to the requester
(e.g. allocID).

Folks aren't really expected to use this API directly, but rather through consul-template
which will soon be getting a new helper function making use of this query parameter.

Example,

curl 'localhost:4646/v1/service/redis?choose=2|abc123'

Note: consul-templte v0.29.1 includes the necessary nomadServices functionality.
2022-06-25 10:37:37 -05:00
Seth Hoenig
f1cafd0789 core: remove support for raft protocol version 2
This PR checks server config for raft_protocol, which must now
be set to 3 or unset (0). When unset, version 3 is used as the
default.
2022-06-23 14:37:50 +00:00
Grant Griffiths
2986f1f18a CSI: make plugin health_timeout configurable in csi_plugin stanza (#13340)
Signed-off-by: Grant Griffiths <ggriffiths@purestorage.com>
2022-06-14 10:04:16 -04:00
Kevin Schoonover
d725acb380 parse ACL token from authorization header (#12534) 2022-06-06 15:51:02 -04:00
Karan Sharma
2d95f06026 feat: Warn if bootstrap_expect is even number (#12961) 2022-06-06 15:22:59 +02:00
Lance Haig
eafc93902b Allow Operator Generated bootstrap token (#12520) 2022-06-03 07:37:24 -04:00
Huan Wang
b6e07487c2 adding support for customized ingress tls (#13184) 2022-06-02 18:43:58 -04:00
Seth Hoenig
69bbaa44f9 docs: add docs and tests for tagged_addresses 2022-05-31 13:02:48 -05:00
Jorge Marey
e2f954e848 Allow setting tagged addresses on services 2022-05-31 10:06:55 -05:00
Seth Hoenig
616988c6fb connect: enable setting connect upstream destination namespace 2022-05-26 09:39:36 -05:00
hc-github-team-nomad-core
abb5b572b0 Generate files for 1.3.1 release 2022-05-24 16:29:46 -04:00
Michael Schurter
3968509886 artifact: fix numerous go-getter security issues
Fix numerous go-getter security issues:

- Add timeouts to http, git, and hg operations to prevent DoS
- Add size limit to http to prevent resource exhaustion
- Disable following symlinks in both artifacts and `job run`
- Stop performing initial HEAD request to avoid file corruption on
  retries and DoS opportunities.

**Approach**

Since Nomad has no ability to differentiate a DoS-via-large-artifact vs
a legitimate workload, all of the new limits are configurable at the
client agent level.

The max size of HTTP downloads is also exposed as a node attribute so
that if some workloads have large artifacts they can specify a high
limit in their jobspecs.

In the future all of this plumbing could be extended to enable/disable
specific getters or artifact downloading entirely on a per-node basis.
2022-05-24 16:29:39 -04:00
Will Jordan
304d0cf595 Don't buffer json logs on agent startup (#13076)
There's no reason to buffer json logs on agent startup
since logs in this format already aren't reordered.
2022-05-19 15:40:30 -04:00
hc-github-team-nomad-core
ddd77b4287 Generate files for 1.3.0 release 2022-05-13 17:32:20 -04:00
hc-github-team-nomad-core
5498afd86c Generate files for 1.3.0-rc.1 release 2022-05-13 17:31:57 -04:00
James Rasell
bab219a8ba agent: fix panic when logging about protocol version config use. (#12962)
The log line comes before the agent logger has been setup,
therefore we need to use the UI logging to avoid panic.
2022-05-13 09:28:43 +02:00
Eng Zer Jun
fca4ee8e05 test: use T.TempDir to create temporary test directory (#12853)
* test: use `T.TempDir` to create temporary test directory

This commit replaces `ioutil.TempDir` with `t.TempDir` in tests. The
directory created by `t.TempDir` is automatically removed when the test
and all its subtests complete.

Prior to this commit, temporary directory created using `ioutil.TempDir`
needs to be removed manually by calling `os.RemoveAll`, which is omitted
in some tests. The error handling boilerplate e.g.
	defer func() {
		if err := os.RemoveAll(dir); err != nil {
			t.Fatal(err)
		}
	}
is also tedious, but `t.TempDir` handles this for us nicely.

Reference: https://pkg.go.dev/testing#T.TempDir
Signed-off-by: Eng Zer Jun <engzerjun@gmail.com>

* test: fix TestLogmon_Start_restart on Windows

Signed-off-by: Eng Zer Jun <engzerjun@gmail.com>

* test: fix failing TestConsul_Integration

t.TempDir fails to perform the cleanup properly because the folder is
still in use

testing.go:967: TempDir RemoveAll cleanup: unlinkat /tmp/TestConsul_Integration2837567823/002/191a6f1a-5371-cf7c-da38-220fe85d10e5/web/secrets: device or resource busy

Signed-off-by: Eng Zer Jun <engzerjun@gmail.com>
2022-05-12 11:42:40 -04:00