Commit Graph

3962 Commits

Author SHA1 Message Date
Drew Bailey
57dc0c6a46 test pprof headers and profile methods
tidy up, add comments

clean up seconds param assignment
2020-01-09 15:15:04 -05:00
Drew Bailey
c28e5ad036 warn when enabled debug is on when registering
m -> a receiver name

return codederrors, fix query
2020-01-09 15:15:04 -05:00
Drew Bailey
fb1b4cdc26 Server request forwarding for Agent.Profile
Return rpc errors for profile requests, set up remote forwarding to
target leader or server id for profile requests.

server forwarding, endpoint tests
2020-01-09 15:15:03 -05:00
Drew Bailey
240c0ee0ec agent pprof endpoints
wip, agent endpoint and client endpoint for pprof profiles

agent endpoint test
2020-01-09 15:15:02 -05:00
Tim Gross
f31482ae8a interpolate environment for services in script checks (#6916)
In 0.10.2 (specifically 387b016) we added interpolation to group
service blocks and centralized the logic for task environment
interpolation. This wasn't also added to script checks, which caused a
regression where the IDs for script checks for services w/
interpolated fields (ex. the service name) didn't match the service ID
that was registered with Consul.

This changeset calls the same taskenv interpolation logic during
`script_check` configuration, and adds tests to reduce the risk of
future regressions by comparing the IDs of service hook and the check hook.
2020-01-09 08:12:54 -05:00
Tim Gross
ac7990f2a7 client: fix trace log message in alloc hook update (#6881) 2019-12-19 16:44:04 -05:00
Drew Bailey
8c1f9b4128 docs for shutdown delay
update docs, address pr comments

ensure pointer is not nil

use pointer for diff tests, set vs unset
2019-12-16 11:38:35 -05:00
Drew Bailey
3b033b2ef5 allow only positive shutdown delay
more explicit test case, remove select statement
2019-12-16 11:38:30 -05:00
Drew Bailey
672b76056b shutdown delay for task groups
copy struct values

ensure groupserviceHook implements RunnerPreKillhook

run deregister first

test that shutdown times are delayed

move magic number into variable
2019-12-16 11:38:16 -05:00
Danielle
59eb882197 Update client/fingerprint/env_aws.go
Co-Authored-By: Mahmood Ali <mahmood@hashicorp.com>
2019-12-16 14:48:52 +01:00
Danielle Lancashire
544807ba79 env_aws: Disable Retries and set Session cfg
Previously, Nomad used hand rolled HTTP requests to interact with the
EC2 metadata API. Recently however, we switched to using the AWS SDK for
this fingerprinting.

The default behaviour of the AWS SDK is to perform retries with
exponential backoff when a request fails. This is problematic for Nomad,
because interacting with the EC2 API is in our client start path.

Here we revert to our pre-existing behaviour of not performing retries
in the fast path, as if the metadata service is unavailable, it's likely
that nomad is not running in AWS.
2019-12-16 10:56:32 +01:00
Mahmood Ali
20f8227c0a Merge pull request #6820 from hashicorp/f-skip-docker-logging-knob
driver: allow disabling log collection
2019-12-13 11:41:20 -05:00
Mahmood Ali
0bdf7a9e23 Merge pull request #6556 from hashicorp/c-vendor-multierror-20191025
Update go-multierror library
2019-12-13 11:32:42 -05:00
Mahmood Ali
e82dad732b address review comments 2019-12-13 11:21:00 -05:00
Mahmood Ali
013097b241 tests: fix error format assertion
multierror library changed formatting slightly.
2019-12-13 11:01:20 -05:00
Chris Dickson
bbb6b2af09 client: expose allocated CPU per task (#6784) 2019-12-09 15:40:22 -05:00
Seth Hoenig
94c60b4cfa tests: swap lib/freeport for tweaked helper/freeport
Copy the updated version of freeport (sdk/freeport), and tweak it for use
in Nomad tests. This means staying below port 10000 to avoid conflicts with
the lib/freeport that is still transitively used by the old version of
consul that we vendor. Also provide implementations to find ephemeral ports
of macOS and Windows environments.

Ports acquired through freeport are supposed to be returned to freeport,
which this change now also introduces. Many tests are modified to include
calls to a cleanup function for Server objects.

This should help quite a bit with some flakey tests, but not all of them.
Our port problems will not go away completely until we upgrade our vendor
version of consul. With Go modules, we'll probably do a 'replace' to swap
out other copies of freeport with the one now in 'nomad/helper/freeport'.
2019-12-09 08:37:32 -06:00
Mahmood Ali
943854469d driver: allow disabling log collection
Operators commonly have docker logs aggregated using various tools and
don't need nomad to manage their docker logs.  Worse, Nomad uses a
somewhat heavy docker api call to collect them and it seems to cause
problems when a client runs hundreds of log collections.

Here we add a knob to disable log aggregation completely for nomad.
When log collection is disabled, we avoid running logmon and
docker_logger for the docker tasks in this implementation.

The downside here is once disabled, `nomad logs ...` commands and API
no longer return logs and operators must corrolate alloc-ids with their
aggregated log info.

This is meant as a stop gap measure.  Ideally, we'd follow up with at
least two changes:

First, we should optimize behavior when we can such that operators don't
need to disable docker log collection.  Potentially by reverting to
using pre-0.9 syslog aggregation in linux environments, though with
different trade-offs.

Second, when/if logs are disabled, nomad logs endpoints should lookup
docker logs api on demand.  This ensures that the cost of log collection
is paid sparingly.
2019-12-08 14:15:03 -05:00
Mahmood Ali
16aef03331 Merge pull request #6788 from hashicorp/b-timeout-logmon-stop
logmon: add timeout to RPC operations
2019-12-06 19:12:06 -05:00
Danielle Lancashire
5efa8994b9 spellcheck: Fix spelling of retrieve 2019-12-05 18:59:47 -06:00
Mahmood Ali
10c97bfb29 Merge pull request #6779 from hashicorp/r-aws-fingerprint-via-library
Use AWS SDK to access EC2 Metadata
2019-12-02 13:30:51 -05:00
Mahmood Ali
3428223004 logmon: add timeout to RPC operations
Add an RPC timeout for logmon.  In
https://github.com/hashicorp/nomad/issues/6461#issuecomment-559747758 ,
`logmonClient.Stop` locked up and indefinitely blocked the task runner
destroy operation.

This is an incremental improvement.  We still need to follow up to
understand how we got to that state, and the full impact of locked-up
Stop and its link to pending allocations on restart.
2019-12-02 10:33:05 -05:00
Mahmood Ali
b55bc6443e fingerprint code refactor
Some code cleanup:

* Use a field for setting EC2 metadata instead of env-vars in testing;
but keep environment variables for backward compatibility reasons

* Update tests to use testify
2019-11-26 10:51:28 -05:00
Mahmood Ali
f24dd5bec9 fingerprint: avoid api query if config overrides it 2019-11-26 10:51:28 -05:00
Mahmood Ali
34b9ee5433 fingerprint: use ec2metadata package 2019-11-26 10:51:27 -05:00
Lars Lehtonen
1c9a66950c client: fix use of T.Fatal inside TestFS_logsImpl_NoFollow() goroutine. 2019-11-25 23:51:28 -08:00
Mahmood Ali
c661d37ca2 fixup! tests: don't assume eth0 network is available 2019-11-21 08:28:20 -05:00
Mahmood Ali
bb81fce18e tests: don't assume eth0 network is available
TestClient_UpdateNodeFromFingerprintKeepsConfig checks a test node
network interface, which is hardcoded to `eth0` and is updated
asynchronously.  This causes flakiness when eth0 isn't available.

Here, we hardcode the value to an arbitrary network interface.
2019-11-20 20:37:30 -05:00
Mahmood Ali
b886e15487 tests: run TestClient_WatchAllocs in non-linux environments 2019-11-20 20:37:29 -05:00
Mahmood Ali
77a0064fcd testS: fix TestClient_RestoreError
When spinning a second client, ensure that it uses new driver
instances, rather than reuse the already shutdown unhealthy drivers from
first instance.

This speeds up tests significantly, but cutting ~50 seconds or so, the
timeout in NewClient until drivers fingerprints.  They never do because
drivers were shutdown already.
2019-11-20 20:37:28 -05:00
Mahmood Ali
47bc949a8f tests: remove TestClient_RestoreError test
TestClient_RestoreError is very slow, taking ~81 seconds.

It has few problematic patterns.  It's unclear what it tests, it
simulates a failure condition where all state db lookup fails and
asserts that alloc fails.  Though starting from
https://github.com/hashicorp/nomad/pull/6216 , we don't fail allocs in
that condition but rather restart them.

Also, the drivers used in second client `c2` are the same singleton
instances used in `c1` and already shutdown.  We ought to start healthy
new driver instances.
2019-11-20 20:37:27 -05:00
Preetha
d4f801d188 Merge pull request #6349 from hashicorp/b-host-stats
client: Return empty values when host stats fail
2019-11-20 10:13:02 -06:00
Lang Martin
c47d52e865 getter: allow the gcs download scheme (#6692) 2019-11-19 09:10:56 -05:00
Nick Ethier
387b016ac4 client: improve group service stanza interpolation and check_re… (#6586)
* client: improve group service stanza interpolation and check_restart support

Interpolation can now be done on group service stanzas. Note that some task runtime specific information
that was previously available when the service was registered poststart of a task is no longer available.

The check_restart stanza for checks defined on group services will now properly restart the allocation upon
check failures if configured.
2019-11-18 13:04:01 -05:00
Drew Bailey
fb49f3c35b add server-id to monitor specific server 2019-11-14 09:53:41 -05:00
Drew Bailey
79411c5e0e coordinate closing of doneCh, use interface to simplify callers
comments
2019-11-05 11:44:26 -05:00
Drew Bailey
33ba36acbd log-json -> json
fix typo command/agent/monitor/monitor.go

Co-Authored-By: Chris Baker <1675087+cgbaker@users.noreply.github.com>

Update command/agent/monitor/monitor.go

Co-Authored-By: Chris Baker <1675087+cgbaker@users.noreply.github.com>

address feedback, lock to prevent send on closed channel

fix lock/unlock for dropped messages
2019-11-05 09:51:59 -05:00
Drew Bailey
bb2a7f4338 address feedback, fix gauge metric name 2019-11-05 09:51:57 -05:00
Drew Bailey
dc977dc8e6 move forwarded monitor request into helper 2019-11-05 09:51:56 -05:00
Drew Bailey
f74bd99b2a monitor command takes no args
rm extra new line

fix lint errors

return after close

fix, simplify test
2019-11-05 09:51:55 -05:00
Drew Bailey
f8eaf1f5af lock in sub select
rm redundant lock

wip to use framing

wip switch to stream frames
2019-11-05 09:51:54 -05:00
Drew Bailey
6bf8617d02 rename function, initialize log level better
underscores instead of dashes for query params
2019-11-05 09:51:53 -05:00
Drew Bailey
1176fc0227 address feedback, use agent_endpoint instead of monitor 2019-11-05 09:51:53 -05:00
Drew Bailey
92d6a30f86 agent:read acl policy for monitor 2019-11-05 09:51:52 -05:00
Drew Bailey
cd60628b31 rpc acl tests for both monitor endpoints 2019-11-05 09:51:51 -05:00
Drew Bailey
e7589301ea enable json formatting, use queryoptions 2019-11-05 09:51:49 -05:00
Drew Bailey
8095b4868a New monitor pkg for shared monitor functionality
Adds new package that can be used by client and server RPC endpoints to
facilitate monitoring based off of a logger

clean up old code

small comment about write

rm old comment about minsize

rename to Monitor

Removes connection logic from monitor command

Keep connection logic in endpoints, use a channel to send results from
monitoring

use new multisink logger and interfaces

small test for dropped messages

update go-hclogger and update sink/intercept logger interfaces
2019-11-05 09:51:49 -05:00
Drew Bailey
890b8a43fb get local rpc endpoint working 2019-11-05 09:51:48 -05:00
Drew Bailey
12819975ee remove log_writer
prefix output with proper spacing

update gzip handler, adjust first byte flow to allow gzip handler bypass

wip, first stab at wiring up rpc endpoint
2019-11-05 09:51:48 -05:00
Michael Schurter
0fcb0d4016 client: fix panic from 0.8 -> 0.10 upgrade
makeAllocTaskServices did not do a nil check on AllocatedResources
which causes a panic when upgrading directly from 0.8 to 0.10. While
skipping 0.9 is not supported we intend to fix serious crashers caused
by such upgrades to prevent cluster outages.

I did a quick audit of the client package and everywhere else that
accesses AllocatedResources appears to be properly guarded by a nil
check.
2019-11-01 07:47:03 -07:00