Commit Graph

15822 Commits

Author SHA1 Message Date
Michael Schurter
d4553b7569 Merge pull request #6218 from hashicorp/f-consul-defaults
consul: use Consul's defaults and env vars
2019-08-28 11:54:44 -07:00
Mahmood Ali
b2ef75e10d Merge pull request #6216 from hashicorp/b-recognize-pending-allocs
alloc_runner: wait when starting suspicious allocs
2019-08-28 14:46:09 -04:00
Mahmood Ali
8b05f87140 rename to hasLocalState, and ignore clientstate
The ClientState being pending isn't a good criteria; as an alloc may
have been updated in-place before it was completed.

Also, updated the logic so we only check for task states.  If an alloc
has deployment state but no persisted tasks at all, restore will still
fail.
2019-08-28 11:44:48 -04:00
Mahmood Ali
ddf2f6be4d Merge pull request #6219 from hashicorp/c-circleci-upgrade-machine-img
upgrade machine image for most jobs
2019-08-28 11:27:04 -04:00
Lang Martin
3f0f3a06c0 Merge pull request #6215 from hashicorp/f-upgrade-go-getter
upgrade go-getter, leave compiled protobuf at version 1.2
2019-08-28 11:01:31 -04:00
Nick Ethier
99742f2665 ar: ensure network forwarding is allowed for bridged allocs (#6196)
* ar: ensure network forwarding is allowed in iptables for bridged allocs

* ensure filter rule exists at setup time
2019-08-28 10:51:34 -04:00
Mahmood Ali
d99da5b656 upgrade machine image for most jobs
Looks like the host unattended upgrades is interferring with chroot
creation.  Here, we upgrade machine image to one without unattended
upgrades misconfigured, across the board except for the `test-docker`
job.

Docker seems to be misbehaving on that image, and we get some unexpected
cgroups errors, e.g. https://circleci.com/gh/hashicorp/nomad/3854 .

Sample recent failures of `test-exec`:

https://circleci.com/gh/hashicorp/nomad/3633
https://circleci.com/gh/hashicorp/nomad/3696
https://circleci.com/gh/hashicorp/nomad/3714
https://circleci.com/gh/hashicorp/nomad/3764
https://circleci.com/gh/hashicorp/nomad/3770
https://circleci.com/gh/hashicorp/nomad/3834
2019-08-28 09:50:56 -04:00
Nick Ethier
f631ec6c2d cli: display group ports and address in alloc status command output (#6189)
* cli: display group ports and address in alloc status command output

* add assertions for port.To = -1 case and convert assertions to testify
2019-08-27 23:59:36 -04:00
Nick Ethier
51750f5732 Add environment variables for connect upstreams (#6171)
* taskenv: add connect upstream env vars + test

* set taskenv upstreams instead of appending

* Update client/taskenv/env.go

Co-Authored-By: Michael Schurter <mschurter@hashicorp.com>
2019-08-27 23:41:38 -04:00
Michael Schurter
6a1bdf04c4 consul: use Consul's defaults and env vars
Use Consul's API package defaults and env vars as Nomad's defaults.
2019-08-27 14:56:52 -07:00
Mahmood Ali
493945a8a4 Alternative approach: avoid restoring
This uses an alternative approach where we avoid restoring the alloc
runner in the first place, if we suspect that the alloc may have been
completed already.
2019-08-27 17:30:55 -04:00
Lang Martin
23d1214947 match pinned versions for sub-modules 2019-08-27 12:58:12 -04:00
Jasmine Dahilig
80dfa33223 expose nomad namespace as environment variable in allocation #5692 (#6192) 2019-08-27 08:38:07 -07:00
Jasmine Dahilig
d29fa2b48c remove network stanza from job init --short example jobspec (#6179) 2019-08-27 07:36:32 -07:00
Mahmood Ali
cbc521e1e7 alloc_runner: wait when starting suspicious allocs
This commit aims to help users running with clients suseptible to the
destroyed alloc being restrarted bug upgrade to latest.  Without this,
such users will have their tasks run unexpectedly on upgrade and only
see the bug resolved after subsequent restart.

If, on restore, the client sees a pending alloc without any other
persisted info, then err on the side that it's an corrupt persisted
state of an alloc instead of the client happening to be killed right
when alloc is assigned to client.

Few reasons motivate this behavior:

Statistically speaking, corruption being the cause is more likely.  A
long running client will have higher chance of having allocs persisted
incorrectly with pending state.  Being killed right when an alloc is
about to start is relatively unlikely.

Also, delaying starting an alloc that hasn't started (by hopefully
seconds) is not as severe as launching too many allocs that may bring
client down.

More importantly, this helps customers upgrade their clients without
risking taking their clients down and destablizing their cluster. We
don't want existing users to force triggering the bug while they upgrade
and restart cluster.
2019-08-26 22:05:31 -04:00
Lang Martin
0aa79ca764 govendor fetch github.com/hashicorp/go-getter@f5101da, protobuf 1.2 2019-08-26 17:54:21 -04:00
Mahmood Ali
f61637026e Merge pull request #6207 from hashicorp/b-gc-destroyed-allocs-rerun
Don't persist allocs of destroyed alloc runners
2019-08-26 17:26:18 -04:00
Tim Gross
e2efeb4911 init: add generated assets into bindata 2019-08-26 14:24:15 -04:00
Mahmood Ali
ff3dedd534 Write to client store while holding lock
Protect against a race where destroying and persist state goroutines
race.

The downside is that the database io operation will run while holding
the lock and may run indefinitely.  The risk of lock being long held is
slow destruction, but slow io has bigger problems.
2019-08-26 13:45:58 -04:00
Danielle
8066f9b8f0 Merge pull request #6181 from hashicorp/dani/scheduler-vol-ro
scheduler: Implicit constraint on readonly hostvol
2019-08-26 17:01:49 +02:00
Mahmood Ali
925eed89c6 Merge pull request #6205 from hashicorp/b-no-golang-29119-workaround
logmon: revert workaround for Windows go1.11 bug
2019-08-26 10:52:51 -04:00
Nick Fagerlund
3d9e44d40f Update middleman-hashicorp container (#6185) 2019-08-26 09:29:08 -05:00
Mahmood Ali
7c1fe3eae5 logmon: log stat error to help debugging 2019-08-26 10:10:20 -04:00
Mahmood Ali
eb5160427f Merge pull request #6204 from hashicorp/c-circleci-tweaks-20190824
ci: use circleci/golang images directly
2019-08-26 10:08:14 -04:00
Mahmood Ali
a80643e46d Don't persist allocs of destroyed alloc runners
This fixes a bug where allocs that have been GCed get re-run again after client
is restarted.  A heavily-used client may launch thousands of allocs on startup
and get killed.

The bug is that an alloc runner that gets destroyed due to GC remains in
client alloc runner set.  Periodically, they get persisted until alloc is
gced by server.  During that  time, the client db will contain the alloc
but not its individual tasks status nor completed state.  On client restart,
client assumes that alloc is pending state and re-runs it.

Here, we fix it by ensuring that destroyed alloc runners don't persist any alloc
to the state DB.

This is a short-term fix, as we should consider revamping client state
management.  Storing alloc and task information in non-transaction non-atomic
concurrently while alloc runner is running and potentially changing state is a
recipe for bugs.

Fixes https://github.com/hashicorp/nomad/issues/5984
Related to https://github.com/hashicorp/nomad/pull/5890
2019-08-25 11:21:28 -04:00
Mahmood Ali
cc3da4a441 logmon: revert workaround for Windows go1.11 bug
Revert e0126123ab now that we are running
with Golang 1.12, and https://github.com/golang/go/issues/29119 is no
longer relevant.
2019-08-24 08:19:44 -04:00
Mahmood Ali
b4a80a7eea Merge pull request #6201 from hashicorp/b-device-stats-interval
initialize device manager stats interval
2019-08-24 08:16:03 -04:00
Mahmood Ali
b6bf83ad72 use circleci/golang images directly
We currently use an container image for `test-devices` job only; while
all other jobs use machine executor.

This allows us to switch golang and protoc verions easily without
manually managing Docker images (which requires building them manually
on a dev machines, etc).  All that while, we install dependencies on
every build in all other jobs..

`test-devices` now is one of the fastest jobs and isn't a constraint or
a bottleneck, so increasing its overhead by few seconds doesn't hurt the
overall developer iteration.

If we split tests effectively later, we can revisit.
2019-08-23 21:59:49 -04:00
Mahmood Ali
f4571cb9a9 use a new image with proper protoc dependency
Fixes `test-devices` job
2019-08-23 21:33:07 -04:00
Mahmood Ali
e87d9cc8a6 Merge pull request #6146 from hashicorp/b-config-template-copy
clientConfig.Copy() to copy template config too
2019-08-23 19:00:57 -04:00
Mahmood Ali
e8ebde4ca2 clientConfig.Copy() to copy template config too 2019-08-23 18:43:22 -04:00
Mahmood Ali
a72a0f8832 Merge pull request #5676 from hashicorp/f-b-upgrade-ugorji-dep-20190508
Update ugorji/go to latest
2019-08-23 18:29:49 -04:00
Lang Martin
4877face87 Merge pull request #6203 from hashicorp/b-chroot-setuid-110
exec driver setuid go-getter update
2019-08-23 16:49:41 -04:00
Lang Martin
5fc06cd65f taskrunner getter set Umask for go-getter, setuid test 2019-08-23 15:59:03 -04:00
Lang Martin
07373be85c govendor fetch github.com/hashicorp/go-getter@6be654f 2019-08-23 15:59:03 -04:00
Mahmood Ali
01983ae59b initialize device manager stats interval
Fixes a bug where we cpu is pigged at 100% due to collecting devices
statistics.  The passed stats interval was ignored, and the default zero
value causes a very tight loop of stats collection.

FWIW, in my testing, it took 2.5-3ms to collect nvidia GPU stats, on a
`g2.2xlarge` ec2 instance.

The stats interval defaults to 1 second and is user configurable.  I
believe this is too frequent as a default, and I may advocate for
reducing it to a value closer to 5s or 10s, but keeping it as is for
now.

Fixes https://github.com/hashicorp/nomad/issues/6057 .
2019-08-23 14:58:34 -04:00
Mahmood Ali
0c4718378b Merge pull request #6200 from hashicorp/r-golang-1.12.9
Update golang to 1.12.9
2019-08-23 14:37:21 -04:00
Tim Gross
c7c8b01122 agent: -dev=connect mode bind to 0.0.0.0
The dev mode flag for connect was binding to the default interface's
IP, but this makes for a bad user experience for the CLI which will
default to 127.0.0.1. If we bind to 0.0.0.0 instead the CLI will work
without further configuration by the user.
2019-08-23 13:51:16 -04:00
Jerome Gravel-Niquet
25e38c8257 Consul service meta (#6193)
* adds meta object to service in job spec, sends it to consul

* adds tests for service meta

* fix tests

* adds docs

* better hashing for service meta, use helper for copying meta when registering service

* tried to be DRY, but looks like it would be more work to use the
helper function
2019-08-23 12:49:02 -04:00
Mahmood Ali
3e1f584495 update circleci builds to use golang 1.12.9 2019-08-23 12:26:47 -04:00
Mahmood Ali
0ccca0ad59 use golang 1.12 2019-08-23 09:44:40 -04:00
Nick Ethier
974ff0392c ar: fix bridge networking port mapping when port.To is unset (#6190) 2019-08-22 21:53:52 -04:00
Preetha
28b0650dc9 Bring 0.9.5 changes to changelog on master branch 2019-08-22 17:35:15 -05:00
Buck Doyle
10e675c200 Remove most Netlify configuration (#6194)
This removes the in-repository Netlify configuration. There are now two
sites backed by the repository, so we must use the web UI to
control the build settings, as having the configuration in-repository
overrides the web UI settings.

The build settings for the two sites are below, as of this commit. See
the extra step in nomad-ui site’s build step that copies the _redirects
file to the correct destination so things are properly forwarded when
you visit the deployment.

nomad-ui:

base directory: ui
build command: ember build && mkdir -p ui-dist/ui && mv dist/* ui-dist/ui/ && cp ../.netlify/ui-redirects ui-dist/_redirects
publish directory: ui/ui-dist

nomad-website:

base directory: website
build command: bundle exec middleman build
publish directory: website/build
2019-08-22 15:54:23 -05:00
Michael Schurter
72193f99be Merge pull request #6121 from hashicorp/f-connect-bootstrap
connect: task hook for bootstrapping envoy sidecar
2019-08-22 10:58:31 -07:00
Michael Schurter
43d89f864e connect: task hook for bootstrapping envoy sidecar
Fixes #6041

Unlike all other Consul operations, boostrapping requires Consul be
available. This PR tries Consul 3 times with a backoff to account for
the group services being asynchronously registered with Consul.
2019-08-22 08:15:32 -07:00
Mahmood Ali
5eee9ee59f Merge pull request #6187 from hashicorp/c-circleci-tweak-20190822
ci: Use more recent base machine executor image for test-rkt
2019-08-22 11:10:05 -04:00
Mahmood Ali
91bccfc83b ci: Use more recent base machine executor image
This fixes a frequent failure in `test-rkt` jobs where dpkg installation
fails.

The image used currently, circleci/classic:201808-01, has unattended
upgrades enabled accidentally, which runs on every build.  This means
that tools get modified unexpectedly during builds, and apt-get commands
may fail as the unattended upgrade is holding package database lock.

This updates `test-rkt` job only because the new image breaks
`test-docker` job (e.g. https://circleci.com/gh/hashicorp/nomad/2641 ),
and I punted on investigating test-docker for another day.
2019-08-22 10:31:57 -04:00
Buck Doyle
2364fb2da1 UI: Add creation time to evaluations table (#6050) 2019-08-22 08:11:24 -05:00
Danielle
fbddb9281b Merge pull request #6175 from hashicorp/dani/remove-hidden-vols
remove hidden field from host volumes
2019-08-22 08:49:54 +02:00