Commit Graph

57 Commits

Author SHA1 Message Date
Tim Gross
5d09d7ad07 reset max query time of blocking queries in client after retries (#25039)
When a blocking query on the client hits a retryable error, we change the max
query time so that it falls within the `RPCHoldTimeout` timeout. But when the
retry succeeds we don't reset it to the original value.

Because the calls to `Node.GetClientAllocs` reuse the same request struct
instead of reallocating it, any retry will cause the agent to poll at a faster
frequency until the agent restarts. No other current RPC on the client has this
behavior, but we'll fix this in the `rpc` method rather than in the caller so
that any future users of the `rpc` method don't have to remember this detail.

Fixes: https://github.com/hashicorp/nomad/issues/25033
2025-02-07 08:45:56 -05:00
Matt Keeler
833e240597 Upgrade to using hashicorp/go-metrics@v0.5.4 (#24856)
* Upgrade to using hashicorp/go-metrics@v0.5.4

This also requires bumping the dependencies for:

* memberlist
* serf
* raft
* raft-boltdb
* (and indirectly hashicorp/mdns due to the memberlist or serf update)

Unlike some other HashiCorp products, Nomads root module is currently expected to be consumed by others. This means that it needs to be treated more like our libraries and upgrade to hashicorp/go-metrics by utilizing its compat packages. This allows those importing the root module to control the metrics module used via build tags.
2025-01-31 15:22:00 -05:00
Tim Gross
6a3803c31e dynamic host volumes: RPC handlers (#24373)
This changeset implements the RPC handlers for Dynamic Host Volumes, including
the plumbing needed to forward requests to clients. The client-side
implementation is stubbed and will be done under a separate PR.

Ref: https://hashicorp.atlassian.net/browse/NET-11549
2024-12-19 09:25:54 -05:00
Tim Gross
30e57c39b0 discovery: correctly handle IPv6 addresses from go-discover (#24649)
Nomad sets a default port when resolving server addresses that don't have
one. When we get a "bare" IPv6 address without a port, we end up with an
unexpected error "too many colons in address" when we try to split the address
and host, because the standard library function expects IPv6 addresses to be
wrapped in brackets as recommended by RFC5952. User-configured addresses avoid
this problem by accepting IP address and port as separate configuration values,
but go-discover emits "bare" IPv6 addresses without a port in IPv6 environments.

Fix this by adding brackets to IPv6 addresses when we get the "too many colons"
error from the stdlib. This will still give erroneous results if the address
includes the port but is missing brackets, but there's no way to unambiguously
parse that address.

Ref: https://www.rfc-editor.org/rfc/rfc5952
Fixes: https://github.com/hashicorp/nomad/issues/24608
2024-12-17 15:49:40 -05:00
Tim Gross
8eaf176868 client: fix IPv6 parsing for client.servers block (#20324)
When the `client.servers` block is parsed, we split the port from the
address. This does not correctly handle IPv6 addresses when they are in URL
format (wrapped in brackets), which we require to disambiguate the port and
address.

Fix the parser to correctly split out the port and handle a missing port value
for IPv6. Update the documentation to make the URL format requirement clear.

Fixes: https://github.com/hashicorp/nomad/issues/20310
2024-04-08 15:06:27 -04:00
Michael Schurter
23e4b7c9d2 Upgrade go-msgpack to v2 (#20173)
Replaces #18812

Upgraded with:
```
find . -name '*.go' -exec sed -i s/"github.com\/hashicorp\/go-msgpack\/codec"/"github.com\/hashicorp\/go-msgpack\/v2\/codec/" '{}' ';'
find . -name '*.go' -exec sed -i s/"github.com\/hashicorp\/net-rpc-msgpackrpc"/"github.com\/hashicorp\/net-rpc-msgpackrpc\/v2/" '{}' ';'
go get
go get -v -u github.com/hashicorp/raft-boltdb/v2
go get -v github.com/hashicorp/serf@5d32001edfaa18d1c010af65db707cdb38141e80
```

see https://github.com/hashicorp/go-msgpack/releases/tag/v2.1.0
for details
2024-03-21 11:44:23 -07:00
Luiz Aoqui
099ee06a60 Revert "deps: update go-metrics to v0.5.3 (#19190)" (#19374)
* Revert "deps: update go-metrics to v0.5.3 (#19190)"

This reverts commit ddb060d8b3.

* changelog: add entry for #19374
2023-12-08 08:46:55 -05:00
Luiz Aoqui
ddb060d8b3 deps: update go-metrics to v0.5.3 (#19190)
Update `go-metrics` to v0.5.3 to pick
https://github.com/hashicorp/go-metrics/pull/146.
2023-11-28 12:37:57 -05:00
hashicorp-copywrite[bot]
2d35e32ec9 Update copyright file headers to BUSL-1.1 2023-08-10 17:27:15 -05:00
Tim Gross
deae9bb62e client: send node secret with every client-to-server RPC (#16799)
In Nomad 1.5.3 we fixed a security bug that allowed bypass of ACL checks if the
request came thru a client node first. But this fix broke (knowingly) the
identification of many client-to-server RPCs. These will be now measured as if
they were anonymous. The reason for this is that many client-to-server RPCs do
not send the node secret and instead rely on the protection of mTLS.

This changeset ensures that the node secret is being sent with every
client-to-server RPC request. In a future version of Nomad we can add
enforcement on the server side, but this was left out of this changeset to
reduce risks to the safe upgrade path.

Sending the node secret as an auth token introduces a new problem during initial
introduction of a client. Clients send many RPCs concurrently with
`Node.Register`, but until the node is registered the node secret is unknown to
the server and will be rejected as invalid. This causes permission denied
errors.

To fix that, this changeset introduces a gate on having successfully made a
`Node.Register` RPC before any other RPCs can be sent (except for `Status.Ping`,
which we need earlier but which also ignores the error because that handler
doesn't do an authorization check). This ensures that we only send requests with
a node secret already known to the server. This also makes client startup a
little easier to reason about because we know `Node.Register` must succeed
first, and it should make for a good place to hook in future plans for secure
introduction of nodes. The tradeoff is that an existing client that has running
allocs will take slightly longer (a second or two) to transition to ready after
a restart, because the transition in `Node.UpdateStatus` is gated at the server
by first submitting `Node.UpdateAlloc` with client alloc updates.
2023-06-22 11:06:49 -04:00
Luiz Aoqui
ee5a08dbb2 Revert "hashicorp/go-msgpack v2 (#16810)" (#17047)
This reverts commit 8a98520d56.
2023-05-01 17:18:34 -04:00
Ian Fijolek
8a98520d56 hashicorp/go-msgpack v2 (#16810)
* Upgrade from hashicorp/go-msgpack v1.1.5 to v2.1.0

Fixes #16808

* Update hashicorp/net-rpc-msgpackrpc to v2 to match go-msgpack

* deps: use go-msgpack v2.0.0

go-msgpack v2.1.0 includes some code changes that we will need to
investigate furthere to assess its impact on Nomad, so keeping this
dependency on v2.0.0 for now since it's no-op.

---------

Co-authored-by: Luiz Aoqui <luiz@hashicorp.com>
2023-04-17 17:02:05 -04:00
hashicorp-copywrite[bot]
f005448366 [COMPLIANCE] Add Copyright and License Headers 2023-04-10 15:36:59 +00:00
Michael Schurter
6809b0b527 Dynamic Node Metadata (#15844)
Fixes #14617
Dynamic Node Metadata allows Nomad users, and their jobs, to update Node metadata through an API. Currently Node metadata is only reloaded when a Client agent is restarted.

Includes new UI for editing metadata as well.

---------

Co-authored-by: Phil Renaud <phil.renaud@hashicorp.com>
2023-02-07 14:42:25 -08:00
Tim Gross
b79d00abf3 implement pre-forwarding auth on select RPCs (#15513)
In #15417 we added a new `Authenticate` method to the server that returns an
`AuthenticatedIdentity` struct. This changeset implements this method for a
small number of RPC endpoints that together represent all the various ways in
which RPCs are sent, so that we can validate that we're happy with this
approach.
2023-01-24 10:52:07 -05:00
Luiz Aoqui
f33bb5ec4e client: retry RPC call when no server is available (#15140)
When a Nomad service starts it tries to establish a connection with
servers, but it also runs alloc runners to manage whatever allocations
it needs to run.

The alloc runner will invoke several hooks to perform actions, with some
of them requiring access to the Nomad servers, such as Native Service
Discovery Registration.

If the alloc runner starts before a connection is established the alloc
runner will fail, causing the allocation to be shutdown. This is
particularly problematic for disconnected allocations that are
reconnecting, as they may fail as soon as the client reconnects.

This commit changes the RPC request logic to retry it, using the
existing retry mechanism, if there are no servers available.
2022-11-04 14:09:39 -04:00
Michael Schurter
01648e615a client: fix data races in config handling (#14139)
Before this change, Client had 2 copies of the config object: config and configCopy. There was no guidance around which to use where (other than configCopy's comment to pass it to alloc runners), both are shared among goroutines and mutated in data racy ways. At least at one point I think the idea was to have `config` be mutable and then grab a lock to overwrite `configCopy`'s pointer atomically. This would have allowed alloc runners to read their config copies in data race safe ways, but this isn't how the current implementation worked.

This change takes the following approach to safely handling configs in the client:

1. `Client.config` is the only copy of the config and all access must go through the `Client.configLock` mutex
2. Since the mutex *only protects the config pointer itself and not fields inside the Config struct:* all config mutation must be done on a *copy* of the config, and then Client's config pointer is overwritten while the mutex is acquired. Alloc runners and other goroutines with the old config pointer will not see config updates.
3. Deep copying is implemented on the Config struct to satisfy the previous approach. The TLS Keyloader is an exception because it has its own internal locking to support mutating in place. An unfortunate complication but one I couldn't find a way to untangle in a timely fashion.
4. To facilitate deep copying I made an *internally backward incompatible API change:* our `helper/funcs` used to turn containers (slices and maps) with 0 elements into nils. This probably saves a few memory allocations but makes it very easy to cause panics. Since my new config handling approach uses more copying, it became very difficult to ensure all code that used containers on configs could handle nils properly. Since this code has caused panics in the past, I fixed it: nil containers are copied as nil, but 0-element containers properly return a new 0-element container. No more "downgrading to nil!"
2022-08-18 16:32:04 -07:00
Yoan Blanc
148fe61cc5 fix: use NewSafeTimer
Signed-off-by: Yoan Blanc <yoan@dosimple.ch>
2022-04-11 19:37:14 +02:00
Yoan Blanc
bda7b1ece0 feat: remove dependency to consul/lib
Signed-off-by: Yoan Blanc <yoan@dosimple.ch>
2022-04-09 13:22:44 +02:00
Michael Schurter
2411d3afd2 core: remove all traces of unused protocol version
Nomad inherited protocol version numbering configuration from Consul and
Serf, but unlike those projects Nomad has never used it. Nomad's
`protocol_version` has always been `1`.

While the code is effectively unused and therefore poses no runtime
risks to leave, I felt like removing it was best because:

1. Nomad's RPC subsystem has been able to evolve extensively without
   needing to increment the version number.
2. Nomad's HTTP API has evolved extensively without increment
   `API{Major,Minor}Version`. If we want to version the HTTP API in the
   future, I doubt this is the mechanism we would choose.
3. The presence of the `server.protocol_version` configuration
   parameter is confusing since `server.raft_protocol` *is* an important
   parameter for operators to consider. Even more confusing is that
   there is a distinct Serf protocol version which is included in `nomad
   server members` output under the heading `Protocol`. `raft_protocol`
   is the *only* protocol version relevant to Nomad developers and
   operators. The other protocol versions are either deadcode or have
   never changed (Serf).
4. If we were to need to version the RPC, HTTP API, or Serf protocols, I
   don't think these configuration parameters and variables are the best
   choice. If we come to that point we should choose a versioning scheme
   based on the use case and modern best practices -- not this 6+ year
   old dead code.
2022-02-18 16:12:36 -08:00
Mahmood Ali
3f7a5c1474 pool: track usage of incoming streams (#10710)
Track usage of incoming streams on a connection. Connections without
reference counts get marked as unused and reaped in a periodic job.

This fixes a bug where `alloc exec` and `alloc fs` sessions get terminated
unexpectedly. Previously, when a client heartbeats switches between
servers, the pool connection reaper eventually identifies the connection
as unused and closes it even if it has an active exec/fs sessions.

Fixes #10579
2021-06-07 10:22:37 -04:00
Tim Gross
bb194cb91d test infrastructure for mock client RPCs (#10193)
This commit includes a new test client that allows overriding the RPC
protocols. Only the RPCs that are passed in are registered, which lets you
implement a mock RPC in the server tests. This commit includes an example of
this for the ClientCSI RPC server.
2021-03-31 16:37:09 -04:00
Kris Hicks
85ed8ddd4f Add gosimple linter (#9590) 2020-12-09 11:05:18 -08:00
Benjamin Buzbee
6a6547b0b6 Fix RPC retry logic in nomad client's rpc.go for blocking queries (#9266) 2020-11-30 15:11:10 -05:00
Pierre Cauchois
ca1b85b36d RPC Timeout/Retries account for blocking requests (#8921)
The current implementation measures RPC request timeout only against
config.RPCHoldTimeout, which is fine for non-blocking requests but will
almost surely be exceeded by long-poll requests that block for minutes
at a time.

This adds an HasTimedOut method on the RPCInfo interface that takes into
account whether the request is blocking, its maximum wait time, and the
RPCHoldTimeout.
2020-09-18 08:58:41 -04:00
Tim Gross
414caf76e5 CSI: move node unmount to server-driven RPCs (#7596)
If a volume-claiming alloc stops and the CSI Node plugin that serves
that alloc's volumes is missing, there's no way for the allocrunner
hook to send the `NodeUnpublish` and `NodeUnstage` RPCs.

This changeset addresses this issue with a redesign of the client-side
for CSI. Rather than unmounting in the alloc runner hook, the alloc
runner hook will simply exit. When the server gets the
`Node.UpdateAlloc` for the terminal allocation that had a volume claim,
it creates a volume claim GC job. This job will made client RPCs to a
new node plugin RPC endpoint, and only once that succeeds, move on to
making the client RPCs to the controller plugin. If the node plugin is
unavailable, the GC job will fail and be requeued.
2020-04-02 16:04:56 -04:00
Yoan Blanc
c3928fe360 fixup! vendor: explicit use of hashicorp/go-msgpack
Signed-off-by: Yoan Blanc <yoan@dosimple.ch>
2020-03-31 09:48:07 -04:00
Yoan Blanc
887f23a351 vendor: explicit use of hashicorp/go-msgpack
Signed-off-by: Yoan Blanc <yoan@dosimple.ch>
2020-03-31 09:45:21 -04:00
Danielle Lancashire
2f72b480c2 client: Rename ClientCSI -> CSIController 2020-03-23 13:58:30 -04:00
Danielle Lancashire
778a32de4a client: Setup CSI RPC Endpoint
This commit introduces a new set of endpoints to a Nomad Client:
ClientCSI.

ClientCSI is responsible for mediating requests from a Nomad Server to
a CSI Plugin running on a Nomad Client. It should only really be used to
make controller RPCs.
2020-03-23 13:58:28 -04:00
Drew Bailey
240c0ee0ec agent pprof endpoints
wip, agent endpoint and client endpoint for pprof profiles

agent endpoint test
2020-01-09 15:15:02 -05:00
Drew Bailey
6bf8617d02 rename function, initialize log level better
underscores instead of dashes for query params
2019-11-05 09:51:53 -05:00
Drew Bailey
1176fc0227 address feedback, use agent_endpoint instead of monitor 2019-11-05 09:51:53 -05:00
Drew Bailey
890b8a43fb get local rpc endpoint working 2019-11-05 09:51:48 -05:00
Drew Bailey
12819975ee remove log_writer
prefix output with proper spacing

update gzip handler, adjust first byte flow to allow gzip handler bypass

wip, first stab at wiring up rpc endpoint
2019-11-05 09:51:48 -05:00
Mahmood Ali
979a6a1778 implement client endpoint of nomad exec
Add a client streaming RPC endpoint for processing nomad exec tasks, by invoking
the relevant task handler for execution.
2019-05-09 16:49:08 -04:00
Mahmood Ali
7840ab51e2 fix more cases of logging arity errors 2019-01-09 09:22:47 -05:00
Michael Schurter
4b236cc4ce client.rpc: don't log errors on shutdown 2018-11-19 16:39:30 -08:00
Michael Schurter
9da25adc54 client: hclog-ify most of the client
Leaving fingerprinters in case that interface changes with plugins.
2018-10-16 16:53:30 -07:00
Alex Dadgar
12ec2e3b60 RPC Retry Watcher 2018-04-03 18:05:28 -07:00
Alex Dadgar
f062c93e95 Client tls 2018-02-15 15:22:57 -08:00
Alex Dadgar
ce37deebf4 client implementation of alloc gc and stats 2018-02-15 13:59:02 -08:00
Alex Dadgar
a80ef65291 Code review feedback 2018-02-15 13:59:02 -08:00
Alex Dadgar
0706f4e86f Add Streaming RPC ack
This PR introduces an ack allowing the receiving end of the streaming
RPC to return any error that may have occured during the establishment
of the streaming RPC.
2018-02-15 13:59:02 -08:00
Alex Dadgar
3c689ba97f Client Stat/List impl 2018-02-15 13:59:02 -08:00
Alex Dadgar
7e5a30d5ae Agent logs 2018-02-15 13:59:02 -08:00
Alex Dadgar
c76b311d06 client tests 2018-02-15 13:59:02 -08:00
Alex Dadgar
5e7a1a44a2 Logs over RPC w/ lots to touch up 2018-02-15 13:59:01 -08:00
Alex Dadgar
c0e01d8a9e New RPC Modes and basic setup for streaming RPC handlers 2018-02-15 13:59:01 -08:00
Alex Dadgar
496ab5f654 Remove circular dependency 2018-02-15 13:59:01 -08:00