Commit Graph

49 Commits

Author SHA1 Message Date
tehut
21841d3067 Add historical journald and log export flags to operator debug command (#26410)
* Add -log-file-export and -log-lookback commands to add historical log to
debug capture
* use monitor.PrepFile() helper for other historical log tests
2025-08-04 13:55:25 -07:00
James Rasell
2ef837f02f cli: Ensure all no argument console messages are the same. (#26331)
Use a constant to ensure consistency across the CLI when displaying
a console message indicating the command was passed arguments when
it takes none.
2025-07-25 07:05:10 +01:00
Daniel Bennett
c46521a80d cli: operator debug: respect NOMAD_REGION env var (#25716)
properly filter out regions other than the one specified
like the -namespace flag does
2025-04-21 17:06:50 -04:00
Sujata Roy
6f34bf3ba7 Nomad Default to 5m duration and trace-level logging 2024-07-09 16:43:02 -07:00
Tim Gross
a50e6267d0 cli: remove redundant allocs profile from operator debug (#20219)
The pprof `allocs` profile is identical to the `heap` profile, just with a
different default view. Collecting only one of the two is sufficient to view all
of `alloc_objects`, `alloc_space`, `inuse_objects`, and `inuse_space`, and
collecting only one means that both views will be of the same profile.

Also improve the docstrings on the goroutine profiles explaining what's in each
so that it's clear why we might want all of debug=0, debug=1, and debug=2.
2024-03-26 08:19:18 -04:00
Tim Gross
02d98b9357 operator debug: fix pprof interval handling (#20206)
The `nomad operator debug` command saves a CPU profile for each interval, and
names these files based on the interval.

The same functions takes a goroutine profile, heap profile, etc. but is missing
the logic to interpolate the file name with the interval. This results in the
operator debug command making potentially many expensive profile requests, and
then overwriting the data. Update the command to save every profile it scrapes,
and number them similarly to the existing CPU profile.

Additionally, the command flags for `-pprof-interval` and `-pprof-duration` were
validated backwards, which meant that we always coerced the `-pprof-interval` to
be the same as the `-pprof-duration`, which always resulted in a single profile
being taken at the start of the bundle. Correct the check as well as change the
defaults to be more sensible.

Fixes: https://github.com/hashicorp/nomad/issues/20151
2024-03-25 09:01:06 -04:00
Kerim Satirli
5e1bbf90fc docs: update all URLs to developer.hashicorp.com (#16247) 2023-10-24 11:00:11 -04:00
James Rasell
ca9e08e6b5 monitor: add log include location option on monitor CLI and API (#18795) 2023-10-20 07:55:22 +01:00
Seth Hoenig
f5b0da1d55 all: swap exp packages for maps, slices (#18311) 2023-08-23 15:42:13 -05:00
hashicorp-copywrite[bot]
a9d61ea3fd Update copyright file headers to BUSL-1.1 2023-08-10 17:27:29 -05:00
Ville Vesilehto
2c463bb038 chore(lint): use Go stdlib variables for HTTP methods and status codes (#17968) 2023-07-26 15:28:09 +01:00
hashicorp-copywrite[bot]
f005448366 [COMPLIANCE] Add Copyright and License Headers 2023-04-10 15:36:59 +00:00
Lance Haig
99f43c1144 Update ioutil library references to os and io respectively for command (#16329)
No user facing changes so I assume no change log is required
2023-03-08 09:20:04 -06:00
Luiz Aoqui
9d28d9eb47 cli: prevent panic on operator debug (#14992)
If the API returns an error during debug bundle collection the CLI was
expanding the wrong error object, resulting in a panic since `err` is
`nil`.
2022-10-20 15:53:58 -04:00
Tim Gross
349501f825 operator debug: write NDJSON for large collections (#14610)
The `operator debug` command writes JSON files from API responses as a single
line containing an array of JSON objects. But some of these files can be
extremely large (GB's) for large production clusters, which makes it difficult
to parse them using typical line-oriented Unix command line tools that can
stream their inputs without consuming a lot of memory.

For collections that are typically large, instead emit newline-delimited JSON.

This changeset includes some first-pass refactoring of this command. It breaks
up monolithic methods that validate a path, create a file, serialize objects,
and write them to disk into smaller functions, some of which can now be
standalone to take advantage of generics.
2022-09-22 10:02:00 -04:00
Seth Hoenig
ff1a30fe8d cleanup more helper updates (#14638)
* cleanup: refactor MapStringStringSliceValueSet to be cleaner

* cleanup: replace SliceStringToSet with actual set

* cleanup: replace SliceStringSubset with real set

* cleanup: replace SliceStringContains with slices.Contains

* cleanup: remove unused function SliceStringHasPrefix

* cleanup: fixup StringHasPrefixInSlice doc string

* cleanup: refactor SliceSetDisjoint to use real set

* cleanup: replace CompareSliceSetString with SliceSetEq

* cleanup: replace CompareMapStringString with maps.Equal

* cleanup: replace CopyMapStringString with CopyMap

* cleanup: replace CopyMapStringInterface with CopyMap

* cleanup: fixup more CopyMapStringString and CopyMapStringInt

* cleanup: replace CopySliceString with slices.Clone

* cleanup: remove unused CopySliceInt

* cleanup: refactor CopyMapStringSliceString to be generic as CopyMapOfSlice

* cleanup: replace CopyMap with maps.Clone

* cleanup: run go mod tidy
2022-09-21 14:53:25 -05:00
Seth Hoenig
1b1a68e42f cleanup: move fs helpers into escapingfs 2022-08-24 14:45:34 -05:00
James Rasell
581390bed1 cli: do not import structs, use API package only. (#13938) 2022-08-02 16:33:08 +02:00
Tim Gross
ad4efceb91 query for leader in operator debug command (#13472)
The `operator debug` command doesn't output the leader anywhere in the
output, which adds extra burden to offline debugging (away from an
ongoing incident where you can simply check manually). Query the
`/v1/status/leader` API but degrade gracefully.
2022-07-06 10:57:44 -04:00
Dave May
522b630825 debug: add version constraint to avoid pprof panic (#12807) 2022-04-28 13:18:55 -04:00
Tim Gross
ab6f13db1d Fix flaky operator debug test (#12501)
We introduced a `pprof-interval` argument to `operator debug` in #11938, and unfortunately this has resulted in a lot of test flakes. The actual command in use is mostly fine (although I've fixed some quirks here), so what's really happened is that the change has revealed some existing issues in the tests. Summary of changes:

* Make first pprof collection synchronous to preserve the existing
  behavior for the common case where the pprof interval matches the
  duration.

* Clamp `operator debug` pprof timing to that of the command. The
  `pprof-duration` should be no more than `duration` and the
  `pprof-interval` should be no more than `pprof-duration`. Clamp the
  values rather than throwing errors, which could change the commands
  that existing users might already have in debugging scripts

* Testing: remove test parallelism

  The `operator debug` tests that stand up servers can't be run in
  parallel, because we don't have a way of canceling the API calls for
  pprof. The agent will still be running the last pprof when we exit,
  and that breaks the next test that talks to that same agent.
  (Because you can only run one pprof at a time on any process!)

  We could split off each subtest into its own server, but this test
  suite is already very slow. In future work we should fix this "for
  real" by making the API call cancelable.


* Testing: assert against unexpected errors in `operator debug` tests.

  If we assert there are no unexpected error outputs, it's easier for
  the developer to debug when something is going wrong with the tests
  because the error output will be presented as a failing test, rather
  than just a failing exit code check. Or worse, no failing exit code
  check!

  This also forces us to be explicit about which tests will return 0
  exit codes but still emit (presumably ignorable) error outputs.

Additional minor bug fixes (mostly in tests) and test refactorings:

* Fix text alignment on pprof Duration in `operator debug` output

* Remove "done" channel from `operator debug` event stream test. The
  goroutine we're blocking for here already tells us it's done by
  sending a value, so block on that instead of an extraneous channel

* Event stream test timer should start at current time, not zero

* Remove noise from `operator debug` test log output. The `t.Logf`
  calls already are picked out from the rest of the test output by
  being prefixed with the filename.

* Remove explicit pprof args so we use the defaults clamped from
  duration/interval
2022-04-07 15:00:07 -04:00
Danish Prakash
ff6ae5fad2 command/operator_debug: add pprof interval (#11938) 2022-04-04 15:24:12 -04:00
Dave May
8d28bfe415 cli: Add event stream capture to nomad operator debug (#11865) 2022-01-17 21:35:51 -05:00
Michael Schurter
dc81f2650a cli: improve debug error messages (#11507)
Improves `nomad debug` error messages when contacting agents that do not
have /v1/agent/host endpoints (the endpoint was added in v0.12.0)

Part of #9568 and manually tested against Nomad v0.8.7.

Hopefully isRedirectError can be reused for more cases listed in #9568
2022-01-17 11:15:17 -05:00
Tim Gross
072d3b6b74 cli: ensure -stale flag is respected by nomad operator debug (#11678)
When a cluster doesn't have a leader, the `nomad operator debug`
command can safely use stale queries to gracefully degrade the
consistency of almost all its queries. The query parameter for these
API calls was not being set by the command.

Some `api` package queries do not include `QueryOptions` because
they target a specific agent, but they can potentially be forwarded to
other agents. If there is no leader, these forwarded queries will
fail. Provide methods to call these APIs with `QueryOptions`.
2021-12-15 10:44:03 -05:00
Dave May
6ede4b9285 cli: refactor operator debug capture (#11466)
* debug: refactor Consul API collection
* debug: refactor Vault API collection
* debug: cleanup test timing
* debug: extend test to multiregion
* debug: save cmdline flags in bundle
* debug: add cli version to output
* Add changelog entry
2021-11-05 19:43:10 -04:00
Dave May
f46b97b2df debug: update default node-id and docs (#11398)
* debug: default node-id to all
* debug: align cli help and website documentation
2021-10-27 13:43:56 -04:00
Dave May
1d30caafad cli: rename paths in debug bundle for clarity (#11307)
* Rename folders to reflect purpose
* Improve captured files test coverage
* Rename CSI plugins output file
* Add changelog entry
* fix test and make changelog message more explicit

Co-authored-by: Luiz Aoqui <luiz@hashicorp.com>
2021-10-13 18:00:55 -04:00
Dave May
6852f21ddd cli: Improved autocomplete support for job dispatch and operator debug (#11270)
* Add autocomplete to nomad job dispatch
* Add autocomplete to nomad operator debug
* Update incorrect comment
* Update test to verify autocomplete
* Add changelog
* Apply lint suggestions
* Create dynamic slices instead of specific length
* Align style across predictors
2021-10-12 20:01:54 -04:00
Dave May
1bd132f09d debug: Improve namespace and region support (#11269)
* Include region and namespace in CLI output
* Add region and prefix matching for server members
* Add namespace and region API outputs to cluster metadata folder
* Add region awareness to WaitForClient helper function
* Add helper functions for SliceStringHasPrefix and StringHasPrefixInSlice
* Refactor test client agent generation
* Add tests for region
* Add changelog
2021-10-12 16:58:41 -04:00
James Rasell
3bffe443ac chore: fix incorrect docstring formatting. 2021-08-30 11:08:12 +02:00
Dave May
b430bafe90 Add remaining pprof profiles to nomad operator debug (#10748)
* Add remaining pprof profiles to debug dump
* Refactor pprof profile capture
* Add WaitForFilesUntil and WaitForResultUntil utility functions
* Add CHANGELOG entry
2021-06-21 14:22:49 -04:00
Yoan Blanc
a814f0253f chore: bump golangci-lint from v1.24 to v1.39
Signed-off-by: Yoan Blanc <yoan@dosimple.ch>
2021-04-03 09:50:23 +02:00
Dave May
5e6cb151c5 debug: Remove extra linefeed in monitor.log (#10252) 2021-03-29 09:22:27 -04:00
Dave May
83af6f5785 debug: update defaults to commonly used values 2021-03-09 08:31:38 -05:00
Dave May
d1648243f4 Handle Consul API URL protocol mismatch (#10082) 2021-02-25 08:22:44 -05:00
Dave May
8038641f1b debug: Fix node count bug from GH-9566 (#9625)
* debug: update test to identify bug in GH-9566
* debug: range tests need fresh cmd each iteration
* debug: fix node count bug in GH-9566
2020-12-14 15:02:48 -05:00
Kris Hicks
85ed8ddd4f Add gosimple linter (#9590) 2020-12-09 11:05:18 -08:00
Kris Hicks
071f4c7596 Add gocritic to golangci-lint config (#9556) 2020-12-08 12:47:04 -08:00
Dave May
d8070e99b1 nomad operator debug - add pprof duration / csi details (#9346)
* debug: add pprof duration CLI argument
* debug: add CSI plugin details
* update help text with ACL requirements
* debug: provide ACL hints upon permission failures
* debug: only write file when pprof retrieve is successful
* debug: add helper function to clean bad characters from dynamic filenames
* debug: ensure files are unable to escape the capture directory
2020-12-01 12:36:05 -05:00
Tim Gross
8a66f11bb3 docs: describe required ACLs for all commands 2020-11-20 13:38:29 -05:00
Tim Gross
89f4f51746 command: remove -namespace from help options when not applicable 2020-11-19 16:28:39 -05:00
Dave May
205b0e7cae nomad operator debug - add client node filtering arguments (#9331)
* operator debug - add client node filtering arguments

* add WaitForClient helper function

* use RPC in WaitForClient to avoid unnecessary imports

* guard against nil values

* move initialization up and shorten test duration

* cleanup nodeLookupFailCount logic

* only display max node notice if we actually tried to capture nodes
2020-11-12 11:25:28 -05:00
Dave May
71a022ad8c Metrics gotemplate support, debug bundle features (#9067)
* add goroutine text profiles to nomad operator debug

* add server-id=all to nomad operator debug

* fix bug from changing metrics from string to []byte

* Add function to return MetricsSummary struct, metrics gotemplate support

* fix bug resolving 'server-id=all' when no servers are available

* add url to operator_debug tests

* removed test section which is used for future operator_debug.go changes

* separate metrics from operator, use only structs from go-metrics

* ensure parent directories are created as needed

* add suggested comments for text debug pprof

* move check down to where it is used

* add WaitForFiles helper function to wait for multiple files to exist

* compact metrics check

Co-authored-by: Drew Bailey <2614075+drewbailey@users.noreply.github.com>

* fix github's silly apply suggestion

Co-authored-by: Drew Bailey <2614075+drewbailey@users.noreply.github.com>
2020-10-14 15:16:10 -04:00
davemay99
bf8bdc94f8 Add metrics command / output to debug bundle 2020-10-05 22:30:01 -04:00
Drew Bailey
0a94c62ca4 run commands for duration and interval without needing to specify servers or nodes 2020-08-31 14:13:03 -04:00
Drew Bailey
41fa0daae2 add license info to operator debug command 2020-08-31 13:22:23 -04:00
Lang Martin
ec3b2bf4a5 command/operator_debug: mkdir before storing agent-host (#8707)
The api calls were reordered, the new order omits the
`agent-host.json` result by fetching it before the directory is
created.
2020-08-28 11:58:06 -04:00
Lang Martin
b5ef217c90 nomad debug renamed to nomad operator debug (#8602)
* renamed: command/debug.go -> command/operator_debug.go
* website: rename debug -> operator debug
* website/pages/api-docs/agent: name in api docs
2020-08-11 15:39:44 -04:00