nomad

mirror of https://github.com/kemko/nomad.git synced 2026-01-01 16:05:42 +03:00

Author	SHA1	Message	Date
James Rasell	5989d5862a	ci: Update golangci-lint to v2 and fix highlighted issues. (#26334 )	2025-07-25 10:44:08 +01:00
Tim Gross	55fe05d353	heartbeat: use leader's ACL token when failing heartbeat (#24241 ) In #23838 we updated the `Node.Update` RPC handler we use for heartbeats to be more strict about requiring node secrets. But when a node goes down, it's the leader that sends the request to mark the node down via `Node.Update` (to itself), and this request was missing the leader ACL needed to authenticate to itself. Add the leader ACL to the request and update the RPC handler test for disconnected-clients to use ACLs, which would have detected this bug. Also added a note to the `Authenticate` comment about how that authentication path requires the leader ACL. Fixes: https://github.com/hashicorp/nomad/issues/24231 Ref: https://hashicorp.atlassian.net/browse/NET-11384	2024-10-17 13:48:20 -04:00
Luiz Aoqui	4a8b01430b	scheduler: retain eval metrics on port collision (#19933 ) When an allocation can't be placed because of a port collision the resulting blocked eval is expected to have a metric reporting the port that caused the conflict, but this metrics was not being emitted when preemption was enabled.	2024-02-09 18:18:48 -05:00
Michael Schurter	66fbc0f67e	identity: default to RS256 for new workload ids (#18882 ) OIDC mandates the support of the RS256 signing algorithm so in order to maximize workload identity's usefulness this change switches from using the EdDSA signing algorithm to RS256. Old keys will continue to use EdDSA but new keys will use RS256. The EdDSA generation code was left in place because it's fast and cheap and I'm not going to lie I hope we get to use it again. Test Updates Most of our Variables and Keyring tests had a subtle assumption in them that the keyring would be initialized by the time the test server had elected a leader. ed25519 key generation is so fast that the fact that it was happening asynchronously with server startup didn't seem to cause problems. Sadly rsa key generation is so slow that basically all of these tests failed. I added a new `testutil.WaitForKeyring` helper to replace `testutil.WaitForLeader` in cases where the keyring must be initialized before the test may continue. However this is mostly used in the `nomad/` package. In the `api` and `command/agent` packages I decided to switch their helpers to wait for keyring initialization by default. This will slow down tests a bit, but allow those packages to not be as concerned with subtle server readiness details. On my machine rsa key generation takes 63ms, so hopefully the difference isn't significant on CI runners. TODO - Docs and changelog entries. - Upgrades - right now upgrades won't get RS256 keys until their root key rotates either manually or after ~30 days. - Observability - I'm not sure there's a way for operators to see if they're using EdDSA or RS256 unless they inspect a key. The JWKS endpoint can be inspected to see if EdDSA will be used for new identities, but it doesn't technically define which key is active. If upgrades can be fixed to automatically rotate keys, we probably don't need to worry about this. Requiem for ed25519 When workload identities were first implemented we did not immediately consider OIDC compliance. Consul, Vault, and many other third parties support JWT auth methods without full OIDC compliance. For the machine<-->machine use cases workload identity is intended to fulfill, OIDC seemed like a bigger risk than asset. EdDSA/ed25519 is the signing algorithm we chose for workload identity JWTs because of all these lovely properties: 1. Deterministic keys that can be derived from our preexisting root keys. This was perhaps the biggest factor since we already had a root encryption key around from which we could derive a signing key. 2. Wonderfully compact: 64 byte private key, 32 byte public key, 64 byte signatures. Just glorious. 3. No parameters. No choices of encodings. It's all well-defined by [RFC 8032](https://datatracker.ietf.org/doc/html/rfc8032). 4. Fastest performing signing algorithm! We don't even care that much about the performance of our chosen algorithm, but what a free bonus! 5. Arguably one of the most secure signing algorithms widely available. Not just from a cryptanalysis perspective, but from an API and usage perspective too. Life was good with ed25519, but sadly it could not last. [IDPs](https://en.wikipedia.org/wiki/Identity_provider), such as AWS's IAM OIDC Provider, love OIDC. They have OIDC implemented for humans, so why not reuse that OIDC support for machines as well? Since OIDC mandates RS256, many implementations don't bother implementing other signing algorithms (or at least not advertising their support). A quick survey of OIDC Discovery endpoints revealed only 2 out of 10 OIDC providers advertised support for anything other than RS256: - [PayPal](https://www.paypalobjects.com/.well-known/openid-configuration) supports HS256 - [Yahoo](https://api.login.yahoo.com/.well-known/openid-configuration) supports ES256 RS256 only: - [GitHub](https://token.actions.githubusercontent.com/.well-known/openid-configuration) - [GitLab](https://gitlab.com/.well-known/openid-configuration) - [Google](https://accounts.google.com/.well-known/openid-configuration) - [Intuit](https://developer.api.intuit.com/.well-known/openid_configuration) - [Microsoft](https://login.microsoftonline.com/fabrikamb2c.onmicrosoft.com/v2.0/.well-known/openid-configuration) - [SalesForce](https://login.salesforce.com/.well-known/openid-configuration) - [SimpleLogin (acquired by ProtonMail)](https://app.simplelogin.io/.well-known/openid-configuration/) - [TFC](https://app.terraform.io/.well-known/openid-configuration)	2023-10-31 11:25:20 -07:00
hashicorp-copywrite[bot]	a9d61ea3fd	Update copyright file headers to BUSL-1.1	2023-08-10 17:27:29 -05:00
Daniel Bennett	c2dc1c58dd	full task cleanup when alloc prerun hook fails (#17104 ) to avoid leaking task resources (e.g. containers, iptables) if allocRunner prerun fails during restore on client restart. now if prerun fails, TaskRunner.MarkFailedKill() will only emit an event, mark the task as failed, and cancel the tr's killCtx, so then ar.runTasks() -> tr.Run() can take care of the actual cleanup. removed from (formerly) tr.MarkFailedDead(), now handled by tr.Run(): * set task state as dead * save task runner local state * task stop hooks also done in tr.Run() now that it's not skipped: * handleKill() to kill tasks while respecting their shutdown delay, and retrying as needed * also includes task preKill hooks * clearDriverHandle() to destroy the task and associated resources * task exited hooks	2023-05-08 13:17:10 -05:00
hashicorp-copywrite[bot]	f005448366	[COMPLIANCE] Add Copyright and License Headers	2023-04-10 15:36:59 +00:00
Michael Schurter	fb085186b7	client/metadata: fix crasher caused by AllowStale = false (#16549 ) Fixes #16517 Given a 3 Server cluster with at least 1 Client connected to Follower 1: If a NodeMeta.{Apply,Read} for the Client request is received by Follower 1 with `AllowStale = false` the Follower will forward the request to the Leader. The Leader, not being connected to the target Client, will forward the RPC to Follower 1. Follower 1, seeing AllowStale=false, will forward the request to the Leader. The Leader, not being connected to... well hoppefully you get the picture: an infinite loop occurs.	2023-03-20 16:32:32 -07:00
Seth Hoenig	1cfa95ee54	tls enforcement flaky tests (#16543 ) * tests: add WaitForLeaders helpers using must/wait timings * tests: start servers for mtls tests together Fixes #16253 (hopefully)	2023-03-17 14:11:13 -05:00
Michael Schurter	542b23e999	Accept Workload Identities for Client RPCs (#16254 ) This change resolves policies for workload identities when calling Client RPCs. Previously only ACL tokens could be used for Client RPCs. Since the same cache is used for both bearer tokens (ACL and Workload ID), the token cache size was doubled. --------- Co-authored-by: James Rasell <jrasell@users.noreply.github.com>	2023-02-27 10:17:47 -08:00
Luiz Aoqui	2659757194	core: enforce strict steps for clients reconnect (#15808 ) When a Nomad client that is running an allocation with `max_client_disconnect` set misses a heartbeat the Nomad server will update its status to `disconnected`. Upon reconnecting, the client will make three main RPC calls: - `Node.UpdateStatus` is used to set the client status to `ready`. - `Node.UpdateAlloc` is used to update the client-side information about allocations, such as their `ClientStatus`, task states etc. - `Node.Register` is used to upsert the entire node information, including its status. These calls are made concurrently and are also running in parallel with the scheduler. Depending on the order they run the scheduler may end up with incomplete data when reconciling allocations. For example, a client disconnects and its replacement allocation cannot be placed anywhere else, so there's a pending eval waiting for resources. When this client comes back the order of events may be: 1. Client calls `Node.UpdateStatus` and is now `ready`. 2. Scheduler reconciles allocations and places the replacement alloc to the client. The client is now assigned two allocations: the original alloc that is still `unknown` and the replacement that is `pending`. 3. Client calls `Node.UpdateAlloc` and updates the original alloc to `running`. 4. Scheduler notices too many allocs and stops the replacement. This creates unnecessary placements or, in a different order of events, may leave the job without any allocations running until the whole state is updated and reconciled. To avoid problems like this clients must update _all_ of its relevant information before they can be considered `ready` and available for scheduling. To achieve this goal the RPC endpoints mentioned above have been modified to enforce strict steps for nodes reconnecting: - `Node.Register` does not set the client status anymore. - `Node.UpdateStatus` sets the reconnecting client to the `initializing` status until it successfully calls `Node.UpdateAlloc`. These changes are done server-side to avoid the need of additional coordination between clients and servers. Clients are kept oblivious of these changes and will keep making these calls as they normally would. The verification of whether allocations have been updates is done by storing and comparing the Raft index of the last time the client missed a heartbeat and the last time it updated its allocations.	2023-01-25 15:53:59 -05:00
Michael Schurter	617f223242	Fixing flaky TestOverlap test (#14780 ) * test: ensure feasible node selected in overlap test * test: warn when getting close to retry limit	2022-10-03 14:35:02 -07:00
Mahmood Ali	757c3c94f2	scheduler: stopped-yet-running allocs are still running (#10446 ) * scheduler: stopped-yet-running allocs are still running * scheduler: test new stopped-but-running logic * test: assert nonoverlapping alloc behavior Also add a simpler Wait test helper to improve line numbers and save few lines of code. * docs: tried my best to describe #10446 it's not concise... feedback welcome * scheduler: fix test that allowed overlapping allocs * devices: only free devices when ClientStatus is terminal * test: output nicer failure message if err==nil Co-authored-by: Mahmood Ali <mahmood@hashicorp.com> Co-authored-by: Michael Schurter <mschurter@hashicorp.com>	2022-09-13 12:52:47 -07:00
Michael Schurter	eeb1da8a2e	test: update tests to properly use AllocDir Also use t.TempDir when possible.	2021-10-19 10:49:07 -07:00
Dave May	1d30caafad	cli: rename paths in debug bundle for clarity (#11307 ) * Rename folders to reflect purpose * Improve captured files test coverage * Rename CSI plugins output file * Add changelog entry * fix test and make changelog message more explicit Co-authored-by: Luiz Aoqui <luiz@hashicorp.com>	2021-10-13 18:00:55 -04:00
Dave May	1bd132f09d	debug: Improve namespace and region support (#11269 ) * Include region and namespace in CLI output * Add region and prefix matching for server members * Add namespace and region API outputs to cluster metadata folder * Add region awareness to WaitForClient helper function * Add helper functions for SliceStringHasPrefix and StringHasPrefixInSlice * Refactor test client agent generation * Add tests for region * Add changelog	2021-10-12 16:58:41 -04:00
Dave May	b430bafe90	Add remaining pprof profiles to nomad operator debug (#10748 ) * Add remaining pprof profiles to debug dump * Refactor pprof profile capture * Add WaitForFilesUntil and WaitForResultUntil utility functions * Add CHANGELOG entry	2021-06-21 14:22:49 -04:00
Mahmood Ali	122a4cb844	tests: use standard library testing.TB Glint pulled in an updated version of mitchellh/go-testing-interface which broke some existing tests because the update added a Parallel() method to testing.T. This switches to the standard library testing.TB which doesn't have a Parallel() method.	2021-06-09 16:18:45 -07:00
Dave May	205b0e7cae	nomad operator debug - add client node filtering arguments (#9331 ) * operator debug - add client node filtering arguments * add WaitForClient helper function * use RPC in WaitForClient to avoid unnecessary imports * guard against nil values * move initialization up and shorten test duration * cleanup nodeLookupFailCount logic * only display max node notice if we actually tried to capture nodes	2020-11-12 11:25:28 -05:00
Dave May	71a022ad8c	Metrics gotemplate support, debug bundle features (#9067 ) * add goroutine text profiles to nomad operator debug * add server-id=all to nomad operator debug * fix bug from changing metrics from string to []byte * Add function to return MetricsSummary struct, metrics gotemplate support * fix bug resolving 'server-id=all' when no servers are available * add url to operator_debug tests * removed test section which is used for future operator_debug.go changes * separate metrics from operator, use only structs from go-metrics * ensure parent directories are created as needed * add suggested comments for text debug pprof * move check down to where it is used * add WaitForFiles helper function to wait for multiple files to exist * compact metrics check Co-authored-by: Drew Bailey <2614075+drewbailey@users.noreply.github.com> * fix github's silly apply suggestion Co-authored-by: Drew Bailey <2614075+drewbailey@users.noreply.github.com>	2020-10-14 15:16:10 -04:00
Mahmood Ali	96a0938b0e	tests: wait until leadership loop finishes Reverts `d5c7d6e491` . We actually need to forward the request to ensure that the leader is properly configured and that establishedLeadership completes.	2020-03-06 14:41:59 -05:00
Mahmood Ali	d5c7d6e491	avoid forwarding leadership checks in tests The tests only care if a test server recognizes the leader.	2020-03-02 13:47:43 -05:00
Mahmood Ali	7a38784244	acl: check ACL against object namespace Fix a bug where a millicious user can access or manipulate an alloc in a namespace they don't have access to. The allocation endpoints perform ACL checks against the request namespace, not the allocation namespace, and performs the allocation lookup independently from namespaces. Here, we check that the requested can access the alloc namespace regardless of the declared request namespace. Ideally, we'd enforce that the declared request namespace matches the actual allocation namespace. Unfortunately, we haven't documented alloc endpoints as namespaced functions; we suspect starting to enforce this will be very disruptive and inappropriate for a nomad point release. As such, we maintain current behavior that doesn't require passing the proper namespace in request. A future major release may start enforcing checking declared namespace.	2019-10-08 12:59:22 -04:00
Mahmood Ali	b748d4d714	tests: attempt to fix TestAutopilot_CleanupStaleRaftServer Also add a utility function for waiting for stable leadership	2019-09-04 08:49:33 -04:00
Mahmood Ali	75349d647e	test helper for registering jobs with acl Test helper that allows registration of jobs when ACL is activated.	2019-04-30 10:23:56 -04:00
Mahmood Ali	d6250ec0d6	tests: IsTravis() -> IsCI() Replace IsTravis() references that is intended for more CI environments rather than for Travis environment specifically.	2019-02-20 08:21:03 -05:00
Mahmood Ali	2af30fb441	tests: expect Docker on AppVeyor Prepare to run docker on AppVeyor Windows environment	2019-02-20 07:41:47 -05:00
Alex Dadgar	95297c608c	goimports	2019-01-22 15:44:31 -08:00
Mahmood Ali	cfb68457bc	tests: WaitForRunning checks for pending only WaitForRunning risks a race condition where the allocation succeeds and completes before WaitForRunning is called (or while it is running). Here, I made the behavior match the function documentation. I considered making it stricter, but callers need to account for allocation terminating immediately after WaitForRunning terminates anyway.	2019-01-10 15:36:57 -05:00
Mahmood Ali	1b7b70f47e	tests: enable and fix tests requiring mock driver	2019-01-10 10:10:11 -05:00
Michael Schurter	4d1a1ac5bb	tests: test logs endpoint against pending task Although the really exciting change is making WaitForRunning return the allocations that it started. This should cut down test boilerplate significantly.	2018-10-16 16:56:55 -07:00
Michael Schurter	c55d166712	client: set host name when migrating over tls Not setting the host name led the Go HTTP client to expect a certificate with a DNS-resolvable name. Since Nomad uses `${role}.${region}.nomad` names ephemeral dir migrations were broken when TLS was enabled. Added an e2e test to ensure this doesn't break again as it's very difficult to test and the TLS configuration is very easy to get wrong.	2018-09-05 17:24:17 -07:00
Alex Dadgar	4cfe3a5043	Fix flaky deadline tests	2018-04-03 16:51:57 -07:00
Alex Dadgar	3099ef05e2	Unmark drain when nodes hit their deadline and only batch/system left and add all job type integration test	2018-03-28 17:25:58 -07:00
Michael Schurter	55bd0c8df7	Use go-testing-interface instead of testing This drops the testings stdlib pkg from our dependencies. Saves a whopping 46kb on our binary (was really hoping for more of a win there), but also avoids potential ugliness with how testing sets flags.	2017-07-25 15:35:19 -07:00
Michael Schurter	df30a110fb	Create AssertUntil helper func	2017-04-06 17:05:09 -07:00
Alex Dadgar	723f8f0343	Benchmark	2016-12-09 14:44:50 -08:00
Abhishek Chanda	4f33b57327	Use the CI env variable Travis exports this variable to all builds. We don't need out own. A number of other CI systems use this variable too.	2016-02-06 01:11:44 -08:00
Alex Dadgar	fb010e2c5d	Small fixes	2016-02-04 14:19:27 -08:00
Alex Dadgar	417c70c10c	ordering issue	2016-01-21 13:28:48 -08:00
Alex Dadgar	85a2c05697	Time Duration fixes	2016-01-21 12:29:13 -08:00
Alex Dadgar	e91abac0f6	Fix a bunch of tests Up timeouts trusty travis beta Increase timeouts	2016-01-20 16:03:53 -08:00
Armon Dadgar	580fd12425	nomad: allow region forwarding for status endpoints	2015-09-06 18:07:05 -07:00
Armon Dadgar	3ee5d2dce2	nomad: testing status endpoint	2015-06-04 13:26:16 +02:00
Armon Dadgar	0fcaba9439	nomad: testing remove peer	2015-06-04 13:02:39 +02:00

45 Commits