When attempting to clone a git repository within a sandbox that is
configured with landlock, the clone will fail with error messages
related to inability to get random bytes for a temporary file.
Including a read rule for `/dev/urandom` resolves the error
and the git clone works as expected.
When we refactored the E2E provisioning to allow it to be reused by the upgrade
testing, we didn't thread the `instance_type` variable from the main module down
into the `provision-infra` module. This prevents you from setting a custom
instance size when deploying the E2E cluster manually.
* docs: revert to labels={"foo.bar": "baz"} style
Back in #24074 I thought it was necessary to wrap labels in a list to
support quoted keys in hcl2. This... doesn't appear to be true at all?
The simpler `labels={...}` syntax appears to work just fine.
I updated the docs and a test (and modernized it a bit). I also switched
some other examples to the `labels = {}` format from the old `labels{}`
format.
* copywronged
* fmtd
In #26169 we started emitting structured logs from the reconciler. But the node
reconciler results are `AllocTuple` structs and not counts, so the information
we put in the logs ends up being pointer addresses in hex. Fix this so that
we're recording the number of allocs in each bucket instead.
Fix another misleading log-line while we're here.
Ref: https://github.com/hashicorp/nomad/pull/26169
When the namespace was not found in state, indicated by a nil
object, we were using the name field of the nil object for the
return error.
This code path does not currently get triggered as the call flow
ensures the namespace will always be found within state. Making
this change makes sure we do not hit this panic in the future.
The go-metrics library retains Prometheus metrics in memory until expiration,
but the expiration logic requires that the metrics are being regularly
scraped. If you don't have a Prometheus server scraping, this leads to
ever-increasing memory usage. In particular, high volume dispatch workloads emit
a large set of label values and if these are not eventually aged out the bulk of
Nomad server memory can end up consumed by metrics.
The Nomad clients store their Nomad identity in memory and within
their state store. While active, it is not possible to dump the
state to view the stored identity token, so having a way to view
the current claims while running aids debugging and operations.
This change adds a client identity workflow, allowing operators
to view the current claims of the nodes identity. It does not
return any of the signing key material.
* e2e: update standalone envoy binary version
fix for:
> === FAIL: e2e/exec2 TestExec2/testCountdash (21.25s)
> exec2_test.go:71:
> ...
> [warning][config] [./source/extensions/config_subscription/grpc/grpc_stream.h:155] DeltaAggregatedResources gRPC config stream to local_agent closed: 3, Envoy 1.29.4 is too old and is not supported by Consul
there's also this warning, but it doesn't seem so fatal:
> [warning][main] [source/server/server.cc:910] There is no configured limit to the number of allowed active downstream connections. Configure a limit in `envoy.resource_monitors.downstream_connections` resource monitor.
picked latest supported from latest consul (1.21.4):
```
$ curl -s localhost:8500/v1/agent/self | jq .xDS.SupportedProxies
{
"envoy": [
"1.34.1",
"1.33.2",
"1.32.5",
"1.31.8"
]
}
```
* e2e: exec2: remove extraneous bits
* reschedule: no reschedule for batch jobs
* unveil: nomad paths get auto-unveiled with unveil_defaults
https://github.com/hashicorp/nomad-driver-exec2/blob/v0.1.0/plugin/driver.go#L514-L522
For a while now, we've had only 2 implementations of the Planner interface in
Nomad: one was the Worker, and the other was the scheduler test harness, which
was then used as argument to the scheduler constructors in FSM and job endpoint
RPC. That's not great, and one of the recent refactors made it apparent that
we're importing testing code in places we really shouldn't. We finally got
called out for it, and this PR attempts to remedy the situation by splitting the
Harness into Plan (which contains actual plan submission logic) and separating
it from testing code.
* fix panic from nil ReschedulePolicy
commit 279775082c (pr #26279)
intended to return an error for sysbatch jobs with a reschedule block,
but in bypassing populating the `ReschedulePolicy`'s pointer fields,
a nil pointer panic occurred before the job could get rejected
with the intended error.
in particular, in `command/agent/job_endpoint.go`, `func ApiTgToStructsTG`,
```
if taskGroup.ReschedulePolicy != nil {
tg.ReschedulePolicy = &structs.ReschedulePolicy{
Attempts: *taskGroup.ReschedulePolicy.Attempts,
Interval: *taskGroup.ReschedulePolicy.Interval,
```
`*taskGroup.ReschedulePolicy.Interval` was a nil pointer.
* fix e2e test jobs
In #8099 we fixed a bug where garbage collecting a job with
`disconnect.stop_on_client_after` would spawn recursive delayed evals. But when
applied to disconnected allocs with `replace=true`, the fix prevents us from
emitting a blocked eval if there's no room for the replacement.
Update the guard on creating blocked evals so that rather than checking for
`IsZero` that we check for being later than the `WaitUntil`. This separates this
guard from the logic guarding the creation of delayed evals so that we can
potentially create both when needed.
Ref: https://github.com/hashicorp/nomad/pull/8099/files#r435198418
* fix(doc): fix links for task driver plugins
host URL was wrong, changed from develoepr to developer
* Update stateful-workloads.mdx
Fix link for Nomad event stream page