Nomad Enterprise users operating in air-gapped or otherwise secured environments
don't want to send license reporting metrics directly from their
servers. Implement manual/offline reporting by periodically recording usage
metrics snapshots in the state store, and providing an API and CLI by which
cluster administrators can download the snapshot for review and out-of-band
transmission to HashiCorp.
This is the CE portion of the work required for implemention in the Enterprise
product. Nomad CE does not perform utilization reporting.
Ref: https://github.com/hashicorp/nomad-enterprise/pull/2673
Ref: https://hashicorp.atlassian.net/browse/NMD-68
Ref: https://go.hashi.co/rfc/nmd-210
The server startup could "hang" to the view of an operator if it
had a key that could not be decrypted or replicated loaded from
the FSM at startup.
In order to prevent this happening, the server startup function
will now use a timeout to wait for the encrypter to be ready. If
the timeout is reached, the error is sent back to the caller which
fails the CLI command. This bubbling of error message will also
flush to logs which will provide addition operator feedback.
The server only cares about keys loaded from the FSM snapshot and
trailing logs before the encrypter should be classed as ready. So
that the encrypter ready function does not get blocked by keys
added outside of the initial Raft load, we take a snapshot of the
decryption tasks as we enter the blocking call, and class these as
our barrier.
The test for `nomad setup vault` command expects a specific `CreateIndex` for the
job it creates. Any Raft write when a server comes up or establishes leadership
can cause this test to break. Interpolate the expected index as we've done for
other indexes on the job to make this test less brittle.
Ref: https://github.com/hashicorp/nomad-enterprise/pull/2673#issuecomment-2847619747
ResolveToken RPC endpoint was only used by the /acl/token/self API. We should migrate to the WI-aware WhoAmI instead.
---------
Co-authored-by: Tim Gross <tgross@hashicorp.com>
First of all, we should not send the unix time, but the monotonic time.
Second of all, RELOADING= and MONOTONIC_USEC fields should be sent in
*single* message not two separate messages.
From the man page of [systemd.service](https://www.freedesktop.org/software/systemd/man/latest/systemd.service.html#Type=)
> notification message via sd_notify(3) that contains the "RELOADING=1" field in
> combination with "MONOTONIC_USEC=" set to the current monotonic time (i.e.
> CLOCK_MONOTONIC in clock_gettime(2)) in μs, formatted as decimal string.
[sd_notify](https://www.freedesktop.org/software/systemd/man/latest/sd_notify.html)
now has code samples of the protocol to clarify.
Without these changes, if you'd set
Type=notify-reload on the agen'ts systemd unit, systemd
would kill the service due to the service not responding to reload
correctly.
trying not to violate the principle of least astonishment.
we want to only auto-enable PKCE on *new* auth methods,
rather than *new or updated* auth methods, to avoid a
scenario where a Nomad admin updates an auth method
sometime in the future -- something innocent like a new
client secret -- and their OIDC provider doesn't like PKCE.
the main concern is that the provider won't like PKCE
in a totally confusing way. error messages rarely
say PKCE directly, so why the user's auth method
suddenly broke would be a big mystery.
this means that to enable it on existing auth methods,
you would set `OIDCDisablePKCE = false`, and the double-
negative doesn't feel right, so instead, swap the language,
so enabling it on *existing* methods reads sensibly, and to
disable it on *new* methods reads ok-enough:
`OIDCEnablePKCE = false`
The `server.num_scheduler` configuration value should be a value
between 0 and the number of CPUs on the machine. The Nomad agent
was not validating the configuration parameter which meant you
could use a negative value or a value much larger than the
available machine CPUs. This change enforces validation of the
configuration value both on server startup and when the agent is
reloaded.
The Nomad API was only performing negative value validation when
updating the scheduler number via this method. This change adds
to the validation to ensure the number is not greater than the
CPUs on the machine.
The agent retry joiner implementation had different parameters
to control its execution for agents running in server and client
mode. The agent would set up individual joiners depending on the
agent mode, making the object parameter overhead unrequired.
This change removes the excess configuration options for the
joiner, reducing code complexity slighly and hopefully making
future modifications in this area easier to make.
Errors from `volume create` or `volume delete` only get logged by the client
agent, which may make it harder for volume authors to debug these tasks if they
are not also the cluster administrator with access to host logs.
Allow plugins to include an optional error message in their response. Because we
can't count on receiving this response (the error could come before the plugin
executes), we parse this message optimistically and include it only if
available.
Ref: https://hashicorp.atlassian.net/browse/NET-12087
PKCE is enabled by default for new/updated auth methods.
* ref: https://oauth.net/2/pkce/
Client assertions are an optional, more secure replacement for client secrets
* ref: https://oauth.net/private-key-jwt/
a change to the existing flow, even without these new options,
is that the oidc.Req is retained on the Nomad server (leader)
in between auth-url and complete-auth calls.
and some fields in auth method config are now more strictly required.
When Nomad registers a service within Consul it is regarded as a
node service. In order for Nomad workloads to read these services,
it must have an ACL policy which includes node_prefix read. If it
does not, the service is filtered out from the result.
This change adds the required permission to the Consul setup
command.
* Basic implementation for server members and node status
* Commands for alloc status and job status
* -ui flag for most commands
* url hints for variables
* url hints for job dispatch, evals, and deployments
* agent config ui.cli_url_links to disable
* Fix an issue where path prefix was presumed for variables
* driver uncomment and general cleanup
* -ui flag on the generic status endpoint
* Job run command gets namespaces, and no longer gets ui hints for --output flag
* Dispatch command hints get a namespace, and bunch o tests
* Lots of tests depend on specific output, so let's not mess with them
* figured out what flagAddress is all about for testServer, oof
* Parallel outside of test instances
* Browser-opening test, sorta
* Env var for disabling/enabling CLI hints
* Addressing a few PR comments
* CLI docs available flags now all have -ui
* PR comments addressed; switched the env var to be consistent and scrunched monitor-adjacent hints a bit more
* ui.Output -> ui.Warn; moves hints from stdout to stderr
* isTerminal check and parseBool on command option
* terminal.IsTerminal check removed for test-runner-not-being-terminal reasons
The group level fields stop_after_client_disconnect,
max_client_disconnect, and prevent_reschedule_on_lost were deprecated in
Nomad 1.8 and replaced by field in the disconnect block. This change
removes any logic related to those deprecated fields.
---------
Co-authored-by: Tim Gross <tgross@hashicorp.com>