Commit Graph

3853 Commits

Author SHA1 Message Date
Mahmood Ali
b2ef75e10d Merge pull request #6216 from hashicorp/b-recognize-pending-allocs
alloc_runner: wait when starting suspicious allocs
2019-08-28 14:46:09 -04:00
Mahmood Ali
8b05f87140 rename to hasLocalState, and ignore clientstate
The ClientState being pending isn't a good criteria; as an alloc may
have been updated in-place before it was completed.

Also, updated the logic so we only check for task states.  If an alloc
has deployment state but no persisted tasks at all, restore will still
fail.
2019-08-28 11:44:48 -04:00
Nick Ethier
99742f2665 ar: ensure network forwarding is allowed for bridged allocs (#6196)
* ar: ensure network forwarding is allowed in iptables for bridged allocs

* ensure filter rule exists at setup time
2019-08-28 10:51:34 -04:00
Nick Ethier
51750f5732 Add environment variables for connect upstreams (#6171)
* taskenv: add connect upstream env vars + test

* set taskenv upstreams instead of appending

* Update client/taskenv/env.go

Co-Authored-By: Michael Schurter <mschurter@hashicorp.com>
2019-08-27 23:41:38 -04:00
Mahmood Ali
493945a8a4 Alternative approach: avoid restoring
This uses an alternative approach where we avoid restoring the alloc
runner in the first place, if we suspect that the alloc may have been
completed already.
2019-08-27 17:30:55 -04:00
Jasmine Dahilig
80dfa33223 expose nomad namespace as environment variable in allocation #5692 (#6192) 2019-08-27 08:38:07 -07:00
Mahmood Ali
cbc521e1e7 alloc_runner: wait when starting suspicious allocs
This commit aims to help users running with clients suseptible to the
destroyed alloc being restrarted bug upgrade to latest.  Without this,
such users will have their tasks run unexpectedly on upgrade and only
see the bug resolved after subsequent restart.

If, on restore, the client sees a pending alloc without any other
persisted info, then err on the side that it's an corrupt persisted
state of an alloc instead of the client happening to be killed right
when alloc is assigned to client.

Few reasons motivate this behavior:

Statistically speaking, corruption being the cause is more likely.  A
long running client will have higher chance of having allocs persisted
incorrectly with pending state.  Being killed right when an alloc is
about to start is relatively unlikely.

Also, delaying starting an alloc that hasn't started (by hopefully
seconds) is not as severe as launching too many allocs that may bring
client down.

More importantly, this helps customers upgrade their clients without
risking taking their clients down and destablizing their cluster. We
don't want existing users to force triggering the bug while they upgrade
and restart cluster.
2019-08-26 22:05:31 -04:00
Mahmood Ali
f61637026e Merge pull request #6207 from hashicorp/b-gc-destroyed-allocs-rerun
Don't persist allocs of destroyed alloc runners
2019-08-26 17:26:18 -04:00
Mahmood Ali
ff3dedd534 Write to client store while holding lock
Protect against a race where destroying and persist state goroutines
race.

The downside is that the database io operation will run while holding
the lock and may run indefinitely.  The risk of lock being long held is
slow destruction, but slow io has bigger problems.
2019-08-26 13:45:58 -04:00
Mahmood Ali
7c1fe3eae5 logmon: log stat error to help debugging 2019-08-26 10:10:20 -04:00
Mahmood Ali
a80643e46d Don't persist allocs of destroyed alloc runners
This fixes a bug where allocs that have been GCed get re-run again after client
is restarted.  A heavily-used client may launch thousands of allocs on startup
and get killed.

The bug is that an alloc runner that gets destroyed due to GC remains in
client alloc runner set.  Periodically, they get persisted until alloc is
gced by server.  During that  time, the client db will contain the alloc
but not its individual tasks status nor completed state.  On client restart,
client assumes that alloc is pending state and re-runs it.

Here, we fix it by ensuring that destroyed alloc runners don't persist any alloc
to the state DB.

This is a short-term fix, as we should consider revamping client state
management.  Storing alloc and task information in non-transaction non-atomic
concurrently while alloc runner is running and potentially changing state is a
recipe for bugs.

Fixes https://github.com/hashicorp/nomad/issues/5984
Related to https://github.com/hashicorp/nomad/pull/5890
2019-08-25 11:21:28 -04:00
Mahmood Ali
cc3da4a441 logmon: revert workaround for Windows go1.11 bug
Revert e0126123ab now that we are running
with Golang 1.12, and https://github.com/golang/go/issues/29119 is no
longer relevant.
2019-08-24 08:19:44 -04:00
Mahmood Ali
b4a80a7eea Merge pull request #6201 from hashicorp/b-device-stats-interval
initialize device manager stats interval
2019-08-24 08:16:03 -04:00
Mahmood Ali
e87d9cc8a6 Merge pull request #6146 from hashicorp/b-config-template-copy
clientConfig.Copy() to copy template config too
2019-08-23 19:00:57 -04:00
Mahmood Ali
e8ebde4ca2 clientConfig.Copy() to copy template config too 2019-08-23 18:43:22 -04:00
Lang Martin
5fc06cd65f taskrunner getter set Umask for go-getter, setuid test 2019-08-23 15:59:03 -04:00
Mahmood Ali
01983ae59b initialize device manager stats interval
Fixes a bug where we cpu is pigged at 100% due to collecting devices
statistics.  The passed stats interval was ignored, and the default zero
value causes a very tight loop of stats collection.

FWIW, in my testing, it took 2.5-3ms to collect nvidia GPU stats, on a
`g2.2xlarge` ec2 instance.

The stats interval defaults to 1 second and is user configurable.  I
believe this is too frequent as a default, and I may advocate for
reducing it to a value closer to 5s or 10s, but keeping it as is for
now.

Fixes https://github.com/hashicorp/nomad/issues/6057 .
2019-08-23 14:58:34 -04:00
Jerome Gravel-Niquet
25e38c8257 Consul service meta (#6193)
* adds meta object to service in job spec, sends it to consul

* adds tests for service meta

* fix tests

* adds docs

* better hashing for service meta, use helper for copying meta when registering service

* tried to be DRY, but looks like it would be more work to use the
helper function
2019-08-23 12:49:02 -04:00
Nick Ethier
974ff0392c ar: fix bridge networking port mapping when port.To is unset (#6190) 2019-08-22 21:53:52 -04:00
Michael Schurter
43d89f864e connect: task hook for bootstrapping envoy sidecar
Fixes #6041

Unlike all other Consul operations, boostrapping requires Consul be
available. This PR tries Consul 3 times with a backoff to account for
the group services being asynchronously registered with Consul.
2019-08-22 08:15:32 -07:00
Michael Schurter
eeacb87f3b connect: register group services with Consul
Fixes #6042

Add new task group service hook for registering group services like
Connect-enabled services.

Does not yet support checks.
2019-08-20 12:25:10 -07:00
Nick Ethier
aaba483787 Builtin Admission Controller Framework (#6116)
* nomad: add admission controller framework

* nomad: add admission controller framework and Consul Connect hooks

* run admission controllers before checking permissions

* client: add default node meta for connect configurables

* nomad: remove validateJob func since it has been moved to admission controller

* nomad: use new TaskKind type

* client: use consts for connect sidecar image and log level

* Apply suggestions from code review

Co-Authored-By: Michael Schurter <mschurter@hashicorp.com>

* nomad: add job register test with connect sidecar

* Update nomad/job_endpoint_hooks.go

Co-Authored-By: Michael Schurter <mschurter@hashicorp.com>
2019-08-15 11:22:37 -04:00
Tim Gross
ffb83e1ef1 client/template: configuration for function blacklist and sandboxing
When rendering a task template, the `plugin` function is no longer
permitted by default and will raise an error. An operator can opt-in
to permitting this function with the new `template.function_blacklist`
field in the client configuration.

When rendering a task template, path parameters for the `file`
function will be treated as relative to the task directory by
default. Relative paths or symlinks that point outside the task
directory will raise an error. An operator can opt-out of this
protection with the new `template.disable_file_sandbox` field in the
client configuration.
2019-08-12 16:34:48 -04:00
Danielle Lancashire
c486143ced Copy documentation to api/tasks 2019-08-12 16:22:27 +02:00
Danielle Lancashire
7b7be83aef HostVolumeConfig: Source -> Path 2019-08-12 15:39:08 +02:00
Danielle Lancashire
af5d42c058 structs: Unify Volume and VolumeRequest 2019-08-12 15:39:08 +02:00
Danielle Lancashire
c3c003dbd6 client: Add volume_hook for mounting volumes 2019-08-12 15:39:08 +02:00
Danielle Lancashire
86b4296f9d client: Add parsing and registration of HostVolume configuration 2019-08-12 15:39:08 +02:00
Nick Ethier
144fb1bfee Revert "client: add autofetch for CNI plugins"
This reverts commit 0bd157cc3b.
2019-08-08 15:10:19 -04:00
Nick Ethier
4b814be995 Revert "client: remove debugging lines"
This reverts commit 54ce4d1f7e.
2019-08-08 14:52:52 -04:00
Mahmood Ali
b6ae5e9d62 Render consul templates using task env only (#6055)
When rendering a task consul template, ensure that only task environment
variables are used.

Currently, `consul-template` always falls back to host process
environment variables when key isn't a task env var[1].  Thus, we add
an empty entry for each host process env-var not found in task env-vars.

[1] bfa5d0e133/template/funcs.go (L61-L75)
2019-08-05 16:30:47 -04:00
Mahmood Ali
2e0d67cbbb Merge pull request #6065 from hashicorp/b-nil-driver-exec
Check if driver handle is nil before execing
2019-08-02 09:48:28 -05:00
Mahmood Ali
488cd7e24e Check if driver handle is nil before execing
Defend against tr.getDriverHandle being nil.  Exec handler checks if
task is running, but it may be stopped between check and driver handler
fetching.
2019-08-02 10:07:41 +08:00
Nick Ethier
0b8fc5d018 client/cni: updated comments and simplified logic to auto download plugins 2019-07-31 01:04:10 -04:00
Nick Ethier
1072084ff3 Apply suggestions from code review
Co-Authored-By: Mahmood Ali <mahmood@hashicorp.com>
2019-07-31 01:04:10 -04:00
Nick Ethier
54ce4d1f7e client: remove debugging lines 2019-07-31 01:04:09 -04:00
Nick Ethier
0bd157cc3b client: add autofetch for CNI plugins 2019-07-31 01:04:09 -04:00
Nick Ethier
4dd5cd103d remove unused file 2019-07-31 01:04:09 -04:00
Nick Ethier
e910fdbb32 fix failing tests 2019-07-31 01:04:07 -04:00
Nick Ethier
dc08ec8783 ar: plumb client config for networking into the network hook 2019-07-31 01:04:06 -04:00
Nick Ethier
e15005bdcb networking: Add new bridge networking mode implementation 2019-07-31 01:04:06 -04:00
Michael Schurter
eb2a2cd76e connect: add group.service stanza support 2019-07-31 01:04:05 -04:00
Nick Ethier
9fa47daf5c ar: fix lint errors 2019-07-31 01:03:19 -04:00
Nick Ethier
56d5fe704a ar: rearrange network hook to support building on windows 2019-07-31 01:03:19 -04:00
Nick Ethier
508ff3cf2f ar: fix test that failed due to error renaming 2019-07-31 01:03:19 -04:00
Nick Ethier
da3978b377 plugins/driver: make DriverNetworkManager interface optional 2019-07-31 01:03:19 -04:00
Nick Ethier
35de444e9b ar: plumb error handling into alloc runner hook initialization 2019-07-31 01:03:18 -04:00
Nick Ethier
16343ae23a ar: add tests for network hook 2019-07-31 01:03:18 -04:00
Nick Ethier
c742f8b580 ar: cleanup lint errors 2019-07-31 01:03:18 -04:00
Nick Ethier
c39e8dca6e ar: move linux specific code to it's own file and add tests 2019-07-31 01:03:18 -04:00