Protect against a race where destroying and persist state goroutines
race.
The downside is that the database io operation will run while holding
the lock and may run indefinitely. The risk of lock being long held is
slow destruction, but slow io has bigger problems.
This fixes a bug where allocs that have been GCed get re-run again after client
is restarted. A heavily-used client may launch thousands of allocs on startup
and get killed.
The bug is that an alloc runner that gets destroyed due to GC remains in
client alloc runner set. Periodically, they get persisted until alloc is
gced by server. During that time, the client db will contain the alloc
but not its individual tasks status nor completed state. On client restart,
client assumes that alloc is pending state and re-runs it.
Here, we fix it by ensuring that destroyed alloc runners don't persist any alloc
to the state DB.
This is a short-term fix, as we should consider revamping client state
management. Storing alloc and task information in non-transaction non-atomic
concurrently while alloc runner is running and potentially changing state is a
recipe for bugs.
Fixes https://github.com/hashicorp/nomad/issues/5984
Related to https://github.com/hashicorp/nomad/pull/5890
Fixes a bug where we cpu is pigged at 100% due to collecting devices
statistics. The passed stats interval was ignored, and the default zero
value causes a very tight loop of stats collection.
FWIW, in my testing, it took 2.5-3ms to collect nvidia GPU stats, on a
`g2.2xlarge` ec2 instance.
The stats interval defaults to 1 second and is user configurable. I
believe this is too frequent as a default, and I may advocate for
reducing it to a value closer to 5s or 10s, but keeping it as is for
now.
Fixes https://github.com/hashicorp/nomad/issues/6057 .
* adds meta object to service in job spec, sends it to consul
* adds tests for service meta
* fix tests
* adds docs
* better hashing for service meta, use helper for copying meta when registering service
* tried to be DRY, but looks like it would be more work to use the
helper function
Fixes#6041
Unlike all other Consul operations, boostrapping requires Consul be
available. This PR tries Consul 3 times with a backoff to account for
the group services being asynchronously registered with Consul.
* nomad: add admission controller framework
* nomad: add admission controller framework and Consul Connect hooks
* run admission controllers before checking permissions
* client: add default node meta for connect configurables
* nomad: remove validateJob func since it has been moved to admission controller
* nomad: use new TaskKind type
* client: use consts for connect sidecar image and log level
* Apply suggestions from code review
Co-Authored-By: Michael Schurter <mschurter@hashicorp.com>
* nomad: add job register test with connect sidecar
* Update nomad/job_endpoint_hooks.go
Co-Authored-By: Michael Schurter <mschurter@hashicorp.com>
When rendering a task template, the `plugin` function is no longer
permitted by default and will raise an error. An operator can opt-in
to permitting this function with the new `template.function_blacklist`
field in the client configuration.
When rendering a task template, path parameters for the `file`
function will be treated as relative to the task directory by
default. Relative paths or symlinks that point outside the task
directory will raise an error. An operator can opt-out of this
protection with the new `template.disable_file_sandbox` field in the
client configuration.
When rendering a task consul template, ensure that only task environment
variables are used.
Currently, `consul-template` always falls back to host process
environment variables when key isn't a task env var[1]. Thus, we add
an empty entry for each host process env-var not found in task env-vars.
[1] bfa5d0e133/template/funcs.go (L61-L75)
Adds a new Prerun and Postrun hooks to manage set up of network namespaces
on linux. Work still needs to be done to make the code platform agnostic and
support Docker style network initalization.