When a node is garbage collected, any dynamic host volumes on the node are
orphaned in the state store. We generally don't want to automatically collect
these volumes and risk data loss, and have provided a CLI flag to `-force`
remove them in #25902. But for clusters running on ephemeral cloud
instances (ex. AWS EC2 in an autoscaling group), deleting host volumes may add
excessive friction. Add a configuration knob to the client configuration to
remove host volumes from the state store on node GC.
Ref: https://github.com/hashicorp/nomad/pull/25902
Ref: https://github.com/hashicorp/nomad/issues/25762
Ref: https://hashicorp.atlassian.net/browse/NMD-705
* Set MaxAllocations in client config
Add NodeAllocationTracker struct to Node struct
Evaluate MaxAllocations in AllocsFit function
Set up cli config parsing
Integrate maxAllocs into AllocatedResources view
Co-authored-by: Tim Gross <tgross@hashicorp.com>
---------
Co-authored-by: Tim Gross <tgross@hashicorp.com>
The server startup could "hang" to the view of an operator if it
had a key that could not be decrypted or replicated loaded from
the FSM at startup.
In order to prevent this happening, the server startup function
will now use a timeout to wait for the encrypter to be ready. If
the timeout is reached, the error is sent back to the caller which
fails the CLI command. This bubbling of error message will also
flush to logs which will provide addition operator feedback.
The server only cares about keys loaded from the FSM snapshot and
trailing logs before the encrypter should be classed as ready. So
that the encrypter ready function does not get blocked by keys
added outside of the initial Raft load, we take a snapshot of the
decryption tasks as we enter the blocking call, and class these as
our barrier.
The `server.num_scheduler` configuration value should be a value
between 0 and the number of CPUs on the machine. The Nomad agent
was not validating the configuration parameter which meant you
could use a negative value or a value much larger than the
available machine CPUs. This change enforces validation of the
configuration value both on server startup and when the agent is
reloaded.
The Nomad API was only performing negative value validation when
updating the scheduler number via this method. This change adds
to the validation to ensure the number is not greater than the
CPUs on the machine.
* Upgrade to using hashicorp/go-metrics@v0.5.4
This also requires bumping the dependencies for:
* memberlist
* serf
* raft
* raft-boltdb
* (and indirectly hashicorp/mdns due to the memberlist or serf update)
Unlike some other HashiCorp products, Nomads root module is currently expected to be consumed by others. This means that it needs to be treated more like our libraries and upgrade to hashicorp/go-metrics by utilizing its compat packages. This allows those importing the root module to control the metrics module used via build tags.
The Nomad client can now optionally emit telemetry data from the
prerun and prestart hooks. This allows operators to monitor and
alert on failures and time taken to complete.
The new datapoints are:
- nomad.client.alloc_hook.prerun.success (counter)
- nomad.client.alloc_hook.prerun.failed (counter)
- nomad.client.alloc_hook.prerun.elapsed (sample)
- nomad.client.task_hook.prestart.success (counter)
- nomad.client.task_hook.prestart.failed (counter)
- nomad.client.task_hook.prestart.elapsed (sample)
The hook execution time is useful to Nomad engineering and will
help optimize code where possible and understand job specification
impacts on hook performance.
Currently only the PreRun and PreStart hooks have telemetry
enabled, so we limit the number of new metrics being produced.
In Nomad 1.4.0, we shipped support for encrypted Variables and signed Workload
Identities, but the key material is protected only by a AEAD encrypting the
KEK. Add support for Vault transit encryption and external KMS from major cloud
providers. The servers call out to the external service to decrypt each key in
the on-disk keystore.
Ref: https://hashicorp.atlassian.net/browse/NET-10334
Fixes: https://github.com/hashicorp/nomad/issues/14852
this is the CE side of an Enterprise-only feature.
a job trying to use this in CE will fail to validate.
to enable daily-scheduled execution entirely client-side,
a job may now contain:
task "name" {
schedule {
cron {
start = "0 12 * * * *" # may not include "," or "/"
end = "0 16" # partial cron, with only {minute} {hour}
timezone = "EST" # anything in your tzdb
}
}
...
and everything about the allocation will be placed as usual,
but if outside the specified schedule, the taskrunner will block
on the client, waiting on the schedule start, before proceeding
with the task driver execution, etc.
this includes a taksrunner hook, which watches for the end of
the schedule, at which point it will kill the task.
then, restarts-allowing, a new task will start and again block
waiting for start, and so on.
this also includes all the plumbing required to pipe API calls
through from command->api->agent->server->client, so that
tasks can be force-run, force-paused, or resume the schedule
on demand.
When loading the client configuration, the user-specified `client.template`
block was not properly merged with the default values. As a result, if the user
set any `client.template` field, all the other field defaulted to their zero
values instead of the documented defaults.
This changeset:
* Adds the missing `Merge` method for the client template config and ensures
it's called.
* Makes a single source of truth for the default template configuration,
instead of two different constructors.
* Extends the tests to cover the merge of a partial block better.
Fixes: https://github.com/hashicorp/nomad/issues/20164
* exec2: add client support for unveil filesystem isolation mode
This PR adds support for a new filesystem isolation mode, "Unveil". The
mode introduces a "alloc_mounts" directory where tasks have user-owned
directory structure which are bind mounts into the real alloc directory
structure. This enables a task driver to use landlock (and maybe the
real unveil on openbsd one day) to isolate a task to the task owned
directory structure, providing sandboxing.
* actually create alloc-mounts-dir directory
* fix doc strings about alloc mount dir paths
* exec: add a client.users configuration block
For now just add min/max dynamic user values; soon we can also absorb
the "user.denylist" and "user.checked_drivers" options from the
deprecated client.options map.
* give the no-op pool implementation a better name
* use explicit error types to make referencing them cleaner in tests
* use import alias to not shadow package name
Remove the now-unused original configuration blocks for Consul and Vault from
the client. When the client needs to refer to a Consul or Vault block it will
always be for a specific cluster for the task/service. Add a helper for
accessing the default clusters (for the client's own use).
This is two of three changesets for this work. The remainder will implement the
same changes in the `command/agent` package.
As part of this work I discovered and fixed two bugs:
* The gRPC proxy socket that we create for Envoy is only ever created using the
default Consul cluster's configuration. This will prevent Connect from being
used with the non-default cluster.
* The Consul configuration we use for templates always comes from the default
Consul cluster's configuration, but will use the correct Consul token for the
non-default cluster. This will prevent templates from being used with the
non-default cluster.
Ref: https://github.com/hashicorp/nomad/issues/18947
Ref: https://github.com/hashicorp/nomad/pull/18991
Fixes: https://github.com/hashicorp/nomad/issues/18984
Fixes: https://github.com/hashicorp/nomad/issues/18983
Remove the now-unused original configuration blocks for Consul and Vault from
the server. When the server needs to refer to a Consul or Vault block it will
always be for a specific cluster for the task/service. Add a helper for
accessing the default clusters (for the servers own use).
This is one of three changesets for this work. The remainder will implement the
same changes in the `client` package and on the `command/agent` package.
As part of this work I discovered that the job submission hook for Vault only
checks the enabled flag on the default cluster, rather than the clusters that
are used by the job being submitted. This will return an error on job
registration saying that Vault is disabled. Fix that to check only the
cluster(s) used by the job.
Ref: https://github.com/hashicorp/nomad/issues/18947
Fixes: https://github.com/hashicorp/nomad/issues/18990
Added the [OIDC Discovery](https://openid.net/specs/openid-connect-discovery-1_0.html) `/.well-known/openid-configuration` endpoint to Nomad, but it is only enabled if the `server.oidc_issuer` parameter is set. Documented the parameter, but without a tutorial trying to actually _use_ this will be very hard.
I intentionally did *not* use https://github.com/hashicorp/cap for the OIDC configuration struct because it's built to be a *compliant* OIDC provider. Nomad is *not* trying to be compliant initially because compliance to the spec does not guarantee it will actually satisfy the requirements of third parties. I want to avoid the problem where in an attempt to be standards compliant we ship configuration parameters that lock us in to a certain behavior that we end up regretting. I want to add parameters and behaviors as there's a demonstrable need.
Users always have the escape hatch of providing their own OIDC configuration endpoint. Nomad just needs to know the Issuer so that the JWTs match the OIDC configuration. There's no reason the actual OIDC configuration JSON couldn't live in S3 and get served directly from there. Unlike JWKS the OIDC configuration should be static, or at least change very rarely.
This PR is just the endpoint extracted from #18535. The `RS256` algorithm still needs to be added in hopes of supporting third parties such as [AWS IAM OIDC Provider](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_roles_providers_create_oidc.html).
Co-authored-by: Luiz Aoqui <luiz@hashicorp.com>
When agents start, they create a shared Consul client that is then wrapped as
various interfaces for testability, and used in constructing the Nomad client
and server. The interfaces that support workload services (rather than the Nomad
agent itself) need to support multiple Consul clusters for Nomad
Enterprise. Update these interfaces to be factory functions that return the
Consul client for a given cluster name. Update the `ServiceClient` to split
workload updates between clusters by creating a wrapper around all the clients
that delegates to the cluster-specific `ServiceClient`.
Ref: https://github.com/hashicorp/team-nomad/issues/404
* client: refactor cpuset partitioning
This PR updates the way Nomad client manages the split between tasks
that make use of resources.cpus vs. resources.cores.
Previously, each task was explicitly assigned which CPU cores they were
able to run on. Every time a task was started or destroyed, all other
tasks' cpusets would need to be updated. This was inefficient and would
crush the Linux kernel when a client would try to run ~400 or so tasks.
Now, we make use of cgroup heirarchy and cpuset inheritence to efficiently
manage cpusets.
* cr: tweaks for feedback
Add the plumbing we need to accept multiple Consul clusters in Nomad agent
configuration, to support upcoming Nomad Enterprise features. The `consul` blocks
are differentiated by a new `name` field, and if the `name` is omitted it
becomes the "default" Consul configuration. All blocks with the same name are
merged together, as with the existing behavior.
As with the `vault` block, we're still using HCL1 for parsing configuration and
the `Decode` method doesn't parse multiple blocks differentiated only by a field
name without a label. So we've had to add an extra parsing pass, similar to what
we've done for HCL1 jobspecs. This also revealed a subtle bug in the `vault`
block handling of extra keys when there are multiple `vault` blocks, which I've
fixed here.
For now, all existing consumers will use the "default" Consul configuration, so
there's no user-facing behavior change in this changeset other than the contents
of the agent self API.
Ref: https://github.com/hashicorp/team-nomad/issues/404
Add the plumbing we need to accept multiple Vault clusters in Nomad agent
configuration, to support upcoming Nomad Enterprise features. The `vault` blocks
are differentiated by a new `name` field, and if the `name` is omitted it
becomes the "default" Vault configuration. All blocks with the same name are
merged together, as with the existing behavior.
Unfortunately we're still using HCL1 for parsing configuration and the `Decode`
method doesn't parse multiple blocks differentiated only by a field name without
a label. So we've had to add an extra parsing pass, similar to what we've done
for HCL1 jobspecs.
For now, all existing consumers will use the "default" Vault configuration, so
there's no user-facing behavior change in this changeset other than the contents
of the agent self API.
Ref: https://github.com/hashicorp/team-nomad/issues/404
Adds a new configuration to clients to optionally allow them to drain their
workloads on shutdown. The client sends the `Node.UpdateDrain` RPC targeting
itself and then monitors the drain state as seen by the server until the drain
is complete or the deadline expires. If it loses connection with the server, it
will monitor local client status instead to ensure allocations are stopped
before exiting.
* api: enable support for setting original source alongside job
This PR adds support for setting job source material along with
the registration of a job.
This includes a new HTTP endpoint and a new RPC endpoint for
making queries for the original source of a job. The
HTTP endpoint is /v1/job/<id>/submission?version=<version> and
the RPC method is Job.GetJobSubmission.
The job source (if submitted, and doing so is always optional), is
stored in the job_submission memdb table, separately from the
actual job. This way we do not incur overhead of reading the large
string field throughout normal job operations.
The server config now includes job_max_source_size for configuring
the maximum size the job source may be, before the server simply
drops the source material. This should help prevent Bad Things from
happening when huge jobs are submitted. If the value is set to 0,
all job source material will be dropped.
* api: avoid writing var content to disk for parsing
* api: move submission validation into RPC layer
* api: return an error if updating a job submission without namespace or job id
* api: be exact about the job index we associate a submission with (modify)
* api: reword api docs scheduling
* api: prune all but the last 6 job submissions
* api: protect against nil job submission in job validation
* api: set max job source size in test server
* api: fixups from pr
* build: add BuildDate to version info
will be used in enterprise to compare to license expiration time
* cli: multi-line version output, add BuildDate
before:
$ nomad version
Nomad v1.4.3 (coolfakecommithashomgoshsuchacoolonewoww)
after:
$ nomad version
Nomad v1.5.0-dev
BuildDate 2023-02-17T19:29:26Z
Revision coolfakecommithashomgoshsuchacoolonewoww
compare consul:
$ consul version
Consul v1.14.4
Revision dae670fe
Build Date 2023-01-26T15:47:10Z
Protocol 2 spoken by default, blah blah blah...
and vault:
$ vault version
Vault v1.12.3 (209b3dd99fe8ca320340d08c70cff5f620261f9b), built 2023-02-02T09:07:27Z
* docs: update version command output
This change introduces the Task API: a portable way for tasks to access Nomad's HTTP API. This particular implementation uses a Unix Domain Socket and, unlike the agent's HTTP API, always requires authentication even if ACLs are disabled.
This PR contains the core feature and tests but followup work is required for the following TODO items:
- Docs - might do in a followup since dynamic node metadata / task api / workload id all need to interlink
- Unit tests for auth middleware
- Caching for auth middleware
- Rate limiting on negative lookups for auth middleware
---------
Co-authored-by: Seth Hoenig <shoenig@duck.com>