In #23838 we updated the `Node.Update` RPC handler we use for heartbeats to be
more strict about requiring node secrets. But when a node goes down, it's the
leader that sends the request to mark the node down via `Node.Update` (to
itself), and this request was missing the leader ACL needed to authenticate to
itself.
Add the leader ACL to the request and update the RPC handler test for
disconnected-clients to use ACLs, which would have detected this bug. Also added
a note to the `Authenticate` comment about how that authentication path requires
the leader ACL.
Fixes: https://github.com/hashicorp/nomad/issues/24231
Ref: https://hashicorp.atlassian.net/browse/NET-11384
When an allocation can't be placed because of a port collision the
resulting blocked eval is expected to have a metric reporting the port
that caused the conflict, but this metrics was not being emitted when
preemption was enabled.
OIDC mandates the support of the RS256 signing algorithm so in order to maximize workload identity's usefulness this change switches from using the EdDSA signing algorithm to RS256.
Old keys will continue to use EdDSA but new keys will use RS256. The EdDSA generation code was left in place because it's fast and cheap and I'm not going to lie I hope we get to use it again.
**Test Updates**
Most of our Variables and Keyring tests had a subtle assumption in them that the keyring would be initialized by the time the test server had elected a leader. ed25519 key generation is so fast that the fact that it was happening asynchronously with server startup didn't seem to cause problems. Sadly rsa key generation is so slow that basically all of these tests failed.
I added a new `testutil.WaitForKeyring` helper to replace `testutil.WaitForLeader` in cases where the keyring must be initialized before the test may continue. However this is mostly used in the `nomad/` package.
In the `api` and `command/agent` packages I decided to switch their helpers to wait for keyring initialization by default. This will slow down tests a bit, but allow those packages to not be as concerned with subtle server readiness details. On my machine rsa key generation takes 63ms, so hopefully the difference isn't significant on CI runners.
**TODO**
- Docs and changelog entries.
- Upgrades - right now upgrades won't get RS256 keys until their root key rotates either manually or after ~30 days.
- Observability - I'm not sure there's a way for operators to see if they're using EdDSA or RS256 unless they inspect a key. The JWKS endpoint can be inspected to see if EdDSA will be used for new identities, but it doesn't technically define which key is active. If upgrades can be fixed to automatically rotate keys, we probably don't need to worry about this.
**Requiem for ed25519**
When workload identities were first implemented we did not immediately consider OIDC compliance. Consul, Vault, and many other third parties support JWT auth methods without full OIDC compliance. For the machine<-->machine use cases workload identity is intended to fulfill, OIDC seemed like a bigger risk than asset.
EdDSA/ed25519 is the signing algorithm we chose for workload identity JWTs because of all these lovely properties:
1. Deterministic keys that can be derived from our preexisting root keys. This was perhaps the biggest factor since we already had a root encryption key around from which we could derive a signing key.
2. Wonderfully compact: 64 byte private key, 32 byte public key, 64 byte signatures. Just glorious.
3. No parameters. No choices of encodings. It's all well-defined by [RFC 8032](https://datatracker.ietf.org/doc/html/rfc8032).
4. Fastest performing signing algorithm! We don't even care that much about the performance of our chosen algorithm, but what a free bonus!
5. Arguably one of the most secure signing algorithms widely available. Not just from a cryptanalysis perspective, but from an API and usage perspective too.
Life was good with ed25519, but sadly it could not last.
[IDPs](https://en.wikipedia.org/wiki/Identity_provider), such as AWS's IAM OIDC Provider, love OIDC. They have OIDC implemented for humans, so why not reuse that OIDC support for machines as well? Since OIDC mandates RS256, many implementations don't bother implementing other signing algorithms (or at least not advertising their support). A quick survey of OIDC Discovery endpoints revealed only 2 out of 10 OIDC providers advertised support for anything other than RS256:
- [PayPal](https://www.paypalobjects.com/.well-known/openid-configuration) supports HS256
- [Yahoo](https://api.login.yahoo.com/.well-known/openid-configuration) supports ES256
RS256 only:
- [GitHub](https://token.actions.githubusercontent.com/.well-known/openid-configuration)
- [GitLab](https://gitlab.com/.well-known/openid-configuration)
- [Google](https://accounts.google.com/.well-known/openid-configuration)
- [Intuit](https://developer.api.intuit.com/.well-known/openid_configuration)
- [Microsoft](https://login.microsoftonline.com/fabrikamb2c.onmicrosoft.com/v2.0/.well-known/openid-configuration)
- [SalesForce](https://login.salesforce.com/.well-known/openid-configuration)
- [SimpleLogin (acquired by ProtonMail)](https://app.simplelogin.io/.well-known/openid-configuration/)
- [TFC](https://app.terraform.io/.well-known/openid-configuration)
to avoid leaking task resources (e.g. containers,
iptables) if allocRunner prerun fails during
restore on client restart.
now if prerun fails, TaskRunner.MarkFailedKill()
will only emit an event, mark the task as failed,
and cancel the tr's killCtx, so then ar.runTasks()
-> tr.Run() can take care of the actual cleanup.
removed from (formerly) tr.MarkFailedDead(),
now handled by tr.Run():
* set task state as dead
* save task runner local state
* task stop hooks
also done in tr.Run() now that it's not skipped:
* handleKill() to kill tasks while respecting
their shutdown delay, and retrying as needed
* also includes task preKill hooks
* clearDriverHandle() to destroy the task
and associated resources
* task exited hooks
Fixes#16517
Given a 3 Server cluster with at least 1 Client connected to Follower 1:
If a NodeMeta.{Apply,Read} for the Client request is received by
Follower 1 with `AllowStale = false` the Follower will forward the
request to the Leader.
The Leader, not being connected to the target Client, will forward the
RPC to Follower 1.
Follower 1, seeing AllowStale=false, will forward the request to the
Leader.
The Leader, not being connected to... well hoppefully you get the
picture: an infinite loop occurs.
This change resolves policies for workload identities when calling Client RPCs. Previously only ACL tokens could be used for Client RPCs.
Since the same cache is used for both bearer tokens (ACL and Workload ID), the token cache size was doubled.
---------
Co-authored-by: James Rasell <jrasell@users.noreply.github.com>
When a Nomad client that is running an allocation with
`max_client_disconnect` set misses a heartbeat the Nomad server will
update its status to `disconnected`.
Upon reconnecting, the client will make three main RPC calls:
- `Node.UpdateStatus` is used to set the client status to `ready`.
- `Node.UpdateAlloc` is used to update the client-side information about
allocations, such as their `ClientStatus`, task states etc.
- `Node.Register` is used to upsert the entire node information,
including its status.
These calls are made concurrently and are also running in parallel with
the scheduler. Depending on the order they run the scheduler may end up
with incomplete data when reconciling allocations.
For example, a client disconnects and its replacement allocation cannot
be placed anywhere else, so there's a pending eval waiting for
resources.
When this client comes back the order of events may be:
1. Client calls `Node.UpdateStatus` and is now `ready`.
2. Scheduler reconciles allocations and places the replacement alloc to
the client. The client is now assigned two allocations: the original
alloc that is still `unknown` and the replacement that is `pending`.
3. Client calls `Node.UpdateAlloc` and updates the original alloc to
`running`.
4. Scheduler notices too many allocs and stops the replacement.
This creates unnecessary placements or, in a different order of events,
may leave the job without any allocations running until the whole state
is updated and reconciled.
To avoid problems like this clients must update _all_ of its relevant
information before they can be considered `ready` and available for
scheduling.
To achieve this goal the RPC endpoints mentioned above have been
modified to enforce strict steps for nodes reconnecting:
- `Node.Register` does not set the client status anymore.
- `Node.UpdateStatus` sets the reconnecting client to the `initializing`
status until it successfully calls `Node.UpdateAlloc`.
These changes are done server-side to avoid the need of additional
coordination between clients and servers. Clients are kept oblivious of
these changes and will keep making these calls as they normally would.
The verification of whether allocations have been updates is done by
storing and comparing the Raft index of the last time the client missed
a heartbeat and the last time it updated its allocations.
* scheduler: stopped-yet-running allocs are still running
* scheduler: test new stopped-but-running logic
* test: assert nonoverlapping alloc behavior
Also add a simpler Wait test helper to improve line numbers and save few
lines of code.
* docs: tried my best to describe #10446
it's not concise... feedback welcome
* scheduler: fix test that allowed overlapping allocs
* devices: only free devices when ClientStatus is terminal
* test: output nicer failure message if err==nil
Co-authored-by: Mahmood Ali <mahmood@hashicorp.com>
Co-authored-by: Michael Schurter <mschurter@hashicorp.com>
* Include region and namespace in CLI output
* Add region and prefix matching for server members
* Add namespace and region API outputs to cluster metadata folder
* Add region awareness to WaitForClient helper function
* Add helper functions for SliceStringHasPrefix and StringHasPrefixInSlice
* Refactor test client agent generation
* Add tests for region
* Add changelog
Glint pulled in an updated version of mitchellh/go-testing-interface
which broke some existing tests because the update added a Parallel()
method to testing.T. This switches to the standard library testing.TB
which doesn't have a Parallel() method.
* operator debug - add client node filtering arguments
* add WaitForClient helper function
* use RPC in WaitForClient to avoid unnecessary imports
* guard against nil values
* move initialization up and shorten test duration
* cleanup nodeLookupFailCount logic
* only display max node notice if we actually tried to capture nodes
* add goroutine text profiles to nomad operator debug
* add server-id=all to nomad operator debug
* fix bug from changing metrics from string to []byte
* Add function to return MetricsSummary struct, metrics gotemplate support
* fix bug resolving 'server-id=all' when no servers are available
* add url to operator_debug tests
* removed test section which is used for future operator_debug.go changes
* separate metrics from operator, use only structs from go-metrics
* ensure parent directories are created as needed
* add suggested comments for text debug pprof
* move check down to where it is used
* add WaitForFiles helper function to wait for multiple files to exist
* compact metrics check
Co-authored-by: Drew Bailey <2614075+drewbailey@users.noreply.github.com>
* fix github's silly apply suggestion
Co-authored-by: Drew Bailey <2614075+drewbailey@users.noreply.github.com>
Reverts d5c7d6e491 .
We actually need to forward the request to ensure that the leader is
properly configured and that establishedLeadership completes.
Fix a bug where a millicious user can access or manipulate an alloc in a
namespace they don't have access to. The allocation endpoints perform
ACL checks against the request namespace, not the allocation namespace,
and performs the allocation lookup independently from namespaces.
Here, we check that the requested can access the alloc namespace
regardless of the declared request namespace.
Ideally, we'd enforce that the declared request namespace matches
the actual allocation namespace. Unfortunately, we haven't documented
alloc endpoints as namespaced functions; we suspect starting to enforce
this will be very disruptive and inappropriate for a nomad point
release. As such, we maintain current behavior that doesn't require
passing the proper namespace in request. A future major release may
start enforcing checking declared namespace.
WaitForRunning risks a race condition where the allocation succeeds and
completes before WaitForRunning is called (or while it is running).
Here, I made the behavior match the function documentation.
I considered making it stricter, but callers need to account for
allocation terminating immediately after WaitForRunning terminates
anyway.
Although the really exciting change is making WaitForRunning return the
allocations that it started. This should cut down test boilerplate
significantly.
Not setting the host name led the Go HTTP client to expect a certificate
with a DNS-resolvable name. Since Nomad uses `${role}.${region}.nomad`
names ephemeral dir migrations were broken when TLS was enabled.
Added an e2e test to ensure this doesn't break again as it's very
difficult to test and the TLS configuration is very easy to get wrong.
This drops the testings stdlib pkg from our dependencies. Saves a
whopping 46kb on our binary (was really hoping for more of a win there),
but also avoids potential ugliness with how testing sets flags.