The new `nomad setup vault -check` commmand can be used to retrieve
information about the changes required before a cluster is migrated from
the deprecated legacy authentication flow with Vault to use only
workload identities.
Some users with batch workloads or short-lived prestart tasks want to derive a
Vaul token, use it, and then allow it to expire without requiring a constant
refresh. Add the `vault.allow_token_expiration` field, which works only with the
Workload Identity workflow and not the legacy workflow.
When set to true, this disables the client's renewal loop in the
`vault_hook`. When Vault revokes the token lease, the token will no longer be
valid. The client will also now automatically detect if the Vault auth
configuration does not allow renewals and will disable the renewal loop
automatically.
Note this should only be used when a secret is requested from Vault once at the
start of a task or in a short-lived prestart task. Long-running tasks should
never set `allow_token_expiration=true` if they obtain Vault secrets via
`template` blocks, as the Vault token will expire and the template runner will
continue to make failing requests to Vault until the `vault_retry` attempts are
exhausted.
Fixes: https://github.com/hashicorp/nomad/issues/8690
Add support for Consul Enterprise admin partitions. We added fingerprinting in
https://github.com/hashicorp/nomad/pull/19485. This PR adds a `consul.partition`
field. The expectation is that most users will create a mapping of Nomad node
pool to Consul admin partition. But we'll also create an implicit constraint for
the fingerprinted value.
Fixes: https://github.com/hashicorp/nomad/issues/13139
Add new optional `OIDCDisableUserInfo` setting for OIDC auth provider which
disables a request to the identity provider to get OIDC UserInfo.
This option is helpful when your identity provider doesn't send any additional
claims from the UserInfo endpoint, such as Microsoft AD FS OIDC Provider:
> The AD FS UserInfo endpoint always returns the subject claim as specified in the
> OpenID standards. AD FS doesn't support additional claims requested via the
> UserInfo endpoint
Fixes#19318
The `defaultVault` variable is a pointer to the Vault configuration
named `default`. Initially, this variable points to the Vault
configuration that is used to load CLI flag values, but after those are
merged with the default and config file values the pointer reference
must be updated before mutating the config with environment variable
values.
The `-dev-consul` and `-dev-vault` flags add default identities and
configuration to the Nomad agent to connect and use the workload
identity integration with Consul and Vault.
When a Connect service is registered with Consul, Nomad includes the nested
`Connect.SidecarService` field that includes health checks for the Envoy
proxy. Because these are not part of the job spec, the alloc health tracker
created by `health_hook` doesn't know to read the value of these checks.
In many circumstances this won't be noticed, but if the Envoy health check
happens to take longer than the `update.min_healthy_time` (perhaps because it's
been set low), it's possible for a deployment to progress too early such that
there will briefly be no healthy instances of the service available in Consul.
Update the Consul service client to find the nested sidecar service in the
service catalog and attach it to the results provided to the tracker. The
tracker can then check the sidecar health checks.
Fixes: https://github.com/hashicorp/nomad/issues/19269
This commit introduces the parameter preventRescheduleOnLost which indicates that the task group can't afford to have multiple instances running at the same time. In the case of a node going down, its allocations will be registered as unknown but no replacements will be rescheduled. If the lost node comes back up, the allocs will reconnect and continue to run.
In case of max_client_disconnect also being enabled, if there is a reschedule policy, an error will be returned.
Implements issue #10366
Co-authored-by: Dom Lavery <dom@circleci.com>
Co-authored-by: Tim Gross <tgross@hashicorp.com>
Co-authored-by: Luiz Aoqui <luiz@hashicorp.com>
Update the `nomad setup consul` command to include a `Selector` for the
`NamespaceRule` so the logic is only applied when the token has a claim
for `consul_namespace`.
Jobs without an explicit `consul.namespace` value receive a JWT without
the `consul_namespace` claim because Nomad is unable to determine which
Consul namespace should be used.
By using `NamespaceRules`, cluster operators are able to set a default
value for these jobs.
* API command and jobspec docs
* PR comments addressed
* API docs for job/jobid/action socket
* Removing a perhaps incorrect origin of job_id across the jobs api doc
* PR comments addressed
In order to correctly handle Consul namespaces, auth methods and binding rules
must always be created in the default namespace only.
---------
Co-authored-by: Luiz Aoqui <luiz@hashicorp.com>
The `-reschedule` flag stops allocations and assumes the Nomad scheduler
will create new allocations to replace them. But this is only true for
service and batch jobs.
Restarting non-service jobs with the `-reschedule` flag causes the
command to loop forever waiting for the allocations to be replaced,
which never happens.
Allocations for system jobs may be replaced by triggering an evaluation
after each stop to cause the reconciler to run again.
Sysbatch jobs should not be allowed to be rescheduled as they are never
replaced by the scheduler.
This simplifies the default setup of Nomad workloads WI-based
authentication for Consul by using a single auth method with 2 binding rules.
Users can still specify separate auth methods for services and tasks.
* Initial pass at a global actions instance queue
* Action card with a bunch of functionality that needs to be pared back a bit
* Happy little actions button
* runAction performs updated to use actions service
* Stop All and Clear Finished buttons
* Keyboard service now passes element, so we can pseudo-click the actions dropdown
* resizable sidebar code blocks
* Contextual actions within task and job levels
* runAction greatly consolidated
* Pluralize action text
* Peer grouping of flyout action intances
* ShortIDs instead of full alloc IDs
* Testfixes that previously depended on notifications
* Stop and stop all for peered action instances
* Job name in action instance card linkable
* Componentized actions global button
* scss consolidation
* Clear and Stop buttons become mutually exclusive in an action card
* Clean up action card title styles a bit
* todo-bashing
* stopAll and stopPeers separated and fixed up
* Socket handling functions moved to the Actions service
* Error handling on socket message
* Smarter import
* Documentation note: need alloc-exec and alloc-raw-exec for raw_exec jobs
* Tests for flyout and dropdown actions
* Docs link when in empty flyout/queue state and percy snapshot test for it
The `nomad job restart` command should skip allocations that already
have replacements. Restarting an allocation with a replacement is a
no-op because the allocation status is terminal and the command's
replacement monitor returns immediatelly.
But by not skipping them, the effective batch size is computed
incorrectly.
The allocrunner has a service registration handler that proxies various API
calls to Consul. With multi-cluster support (for ENT), the service registration
handler is what selects the correct Consul client. The name of this field in the
allocrunner and taskrunner code base looks like it's referring to the actual
Consul API client. This was actually the case before Nomad native service
discovery was implemented, but now the name is misleading.
When creating the binding rule, `BindName` must match the pattern used
for the role name, otherwise the task will not be able to login to
Consul.
Also update the equality check for the binding rule to ensure this
property is held even if the auth method already has existing binding
rules attached.
* make the little dots consistent
* don't trim delimiter as that over matches
* test jobspec2 package
* copy api/WorkloadIdentity.TTL -> structs
* test ttl parsing
* fix hcl1 v 2 parsing mismatch
* make jobspec(1) tests match jobspec2 tests
Remove the now-unused original configuration blocks for Consul and Vault from
the client. When the client needs to refer to a Consul or Vault block it will
always be for a specific cluster for the task/service. Add a helper for
accessing the default clusters (for the client's own use).
This is two of three changesets for this work. The remainder will implement the
same changes in the `command/agent` package.
As part of this work I discovered and fixed two bugs:
* The gRPC proxy socket that we create for Envoy is only ever created using the
default Consul cluster's configuration. This will prevent Connect from being
used with the non-default cluster.
* The Consul configuration we use for templates always comes from the default
Consul cluster's configuration, but will use the correct Consul token for the
non-default cluster. This will prevent templates from being used with the
non-default cluster.
Ref: https://github.com/hashicorp/nomad/issues/18947
Ref: https://github.com/hashicorp/nomad/pull/18991
Fixes: https://github.com/hashicorp/nomad/issues/18984
Fixes: https://github.com/hashicorp/nomad/issues/18983
Submitting a Consul or Vault token with a job is deprecated in Nomad 1.7 and
intended for removal in Nomad 1.9. We added a deprecation warning to the CLI
when the user passes in the appropriate flag or environment variable in
does not use Vault or Consul but happen to have the appropriate environment
variable in your environment. While this is generally a bad practice (because
the token is leaked to Nomad), it's also the existing practice for some users.
Move the warning to the job admission hook. This will allow us to warn only when
appropriate, and that will also help the migration process by producing warnings
only for the relevant jobs.