Commit Graph

4268 Commits

Author SHA1 Message Date
Luiz Aoqui
329807bd7f docs: add cpu-allocated and memory-allocated (#15299)
Document the Autoscaler Nomad APM paramemeters `cpu-allocated` and
`memory-allocated` that were implemented in
https://github.com/hashicorp/nomad-autoscaler/pull/324 and
https://github.com/hashicorp/nomad-autoscaler/pull/334
2022-11-18 10:55:17 -05:00
Tim Gross
21c2d1593a remove deprecated AllocUpdateRequestType raft entry (#15285)
After Deployments were added in Nomad 0.6.0, the `AllocUpdateRequestType` raft
log entry was no longer in use. Mark this as deprecated, remove the associated
dead code, and remove references to the metrics it emits from the docs. We'll
leave the entry itself just in case we encounter old raft logs that we need to
be able to safely load.
2022-11-17 12:08:04 -05:00
Ayrat Badykov
322c6b3dce fix create snapshot request docs (#15242) 2022-11-17 08:43:40 +01:00
Nikita Beletskii
b55ab6318e Fix variable create API example in docs (#15248) 2022-11-15 16:04:11 +01:00
Tim Gross
65b3d01aab eval delete: move batching of deletes into RPC handler and state (#15117)
During unusual outage recovery scenarios on large clusters, a backlog of
millions of evaluations can appear. In these cases, the `eval delete` command can
put excessive load on the cluster by listing large sets of evals to extract the
IDs and then sending larges batches of IDs. Although the command's batch size
was carefully tuned, we still need to be JSON deserialize, re-serialize to
MessagePack, send the log entries through raft, and get the FSM applied.

To improve performance of this recovery case, move the batching process into the
RPC handler and the state store. The design here is a little weird, so let's
look a the failed options first:

* A naive solution here would be to just send the filter as the raft request and
  let the FSM apply delete the whole set in a single operation. Benchmarking with
  1M evals on a 3 node cluster demonstrated this can block the FSM apply for
  several minutes, which puts the cluster at risk if there's a leadership
  failover (the barrier write can't be made while this apply is in-flight).

* A less naive but still bad solution would be to have the RPC handler filter
  and paginate, and then hand a list of IDs to the existing raft log
  entry. Benchmarks showed this blocked the FSM apply for 20-30s at a time and
  took roughly an hour to complete.

Instead, we're filtering and paginating in the RPC handler to find a page token,
and then passing both the filter and page token in the raft log. The FSM apply
recreates the paginator using the filter and page token to get roughly the same
page of evaluations, which it then deletes. The pagination process is fairly
cheap (only abut 5% of the total FSM apply time), so counter-intuitively this
rework ends up being much faster. A benchmark of 1M evaluations showed this
blocked the FSM apply for 20-30ms at a time (typical for normal operations) and
completes in less than 4 minutes.

Note that, as with the existing design, this delete is not consistent: a new
evaluation inserted "behind" the cursor of the pagination will fail to be
deleted.
2022-11-14 14:08:13 -05:00
Douglas Jose
1217a96edf Fix wrong reference to vault (#15228) 2022-11-14 10:49:09 +01:00
Kyle Root
263ed6f9c6 Fix broken URL to nvidia device plugin (#15234) 2022-11-14 10:37:06 +01:00
Tim Gross
11a5f79084 exec: allow running commands from host volume (#14851)
The exec driver and other drivers derived from the shared executor check the
path of the command before handing off to libcontainer to ensure that the
command doesn't escape the sandbox. But we don't check any host volume mounts,
which should be safe to use as a source for executables if we're letting the
user mount them to the container in the first place.

Check the mount config to verify the executable lives in the mount's host path,
but then return an absolute path within the mount's task path so that we can hand
that off to libcontainer to run.

Includes a good bit of refactoring here because the anchoring of the final task
path has different code paths for inside the task dir vs inside a mount. But
I've fleshed out the test coverage of this a good bit to ensure we haven't
created any regressions in the process.
2022-11-11 09:51:15 -05:00
Seth Hoenig
106dce9c9f docs: clarify how to access task meta values in templates (#15212)
This PR updates template and meta docs pages to give examples of accessing
meta values in templates. To do so one must use the environment variable form
of the meta key name, which isn't obvious and wasn't yet documented.
2022-11-10 16:11:53 -06:00
twunderlich-grapl
1b5eedc07a Fix s3 example URLs in the artifacts docs (#15123)
* Fix s3 URLs so that they work

Unfortunately, s3 urls prefixed with https:// do NOT work with the underlying go-getter library. As such, this fixes the examples so that they are working examples that won't cause problems for people reading the docs.
See discussion in https://github.com/hashicorp/nomad/issues/1113 circa 2016.

* Use s3:// protocol schema for artifact examples

Per the discussion in https://github.com/hashicorp/nomad/pull/15123,
we're going to use the explicit s3 protocol in the examples since that
is the likeliest to work in all scenarios
2022-11-07 14:14:57 -05:00
Tim Gross
ce0e0768ff API for Eval.Count (#15147)
Add a new `Eval.Count` RPC and associated HTTP API endpoints. This API is
designed to support interactive use in the `nomad eval delete` command to get a
count of evals expected to be deleted before doing so.

The state store operations to do this sort of thing are somewhat expensive, but
it's cheaper than serializing a big list of evals to JSON. Note that although it
seems like this could be done as an extra parameter and response field on
`Eval.List`, having it as its own endpoint avoids having to change the response
body shape and lets us avoid handling the legacy filter params supported by
`Eval.List`.
2022-11-07 08:53:19 -05:00
Charlie Voiselle
52a254ba22 template: error on missing key (#15141)
* Support error_on_missing_value for templates
* Update docs for template stanza
2022-11-04 13:23:01 -04:00
Phil Renaud
85f472189a Accidentally trailed off on a docs paragraph (#15118) 2022-11-02 23:33:41 -04:00
Phil Renaud
1a29e72f7f [ui] Adds meta to job list stub and displays a pack logo on the jobs index (#14833)
* Adds meta to job list stub and displays a pack logo on the jobs index

* Changelog

* Modifying struct for optional meta param

* Explicitly ask for meta anytime I look up a job from index or job page

* Test case for the endpoint

* adding meta field to API struct and ommitting from response if empty

* passthru method added to api/jobs.list

* Meta param listed in docs for jobs list

* Update api/jobs.go

Co-authored-by: Tim Gross <tgross@hashicorp.com>

Co-authored-by: Tim Gross <tgross@hashicorp.com>
2022-11-02 16:58:24 -04:00
Tim Gross
6b2da83f6a keyring: safely handle missing keys and restore GC (#15092)
When replication of a single key fails, the replication loop breaks early and
therefore keys that fall later in the sorting order will never get
replicated. This is particularly a problem for clusters impacted by the bug that
caused #14981 and that were later upgraded; the keys that were never replicated
can now never be replicated, and so we need to handle them safely.

Included in the replication fix:
* Refactor the replication loop so that each key replicated in a function call
  that returns an error, to make the workflow more clear and reduce nesting. Log
  the error and continue.
* Improve stability of keyring replication tests. We no longer block leadership
  on initializing the keyring, so there's a race condition in the keyring tests
  where we can test for the existence of the root key before the keyring has
  been initialize. Change this to an "eventually" test.

But these fixes aren't enough to fix #14981 because they'll end up seeing an
error once a second complaining about the missing key, so we also need to fix
keyring GC so the keys can be removed from the state store. Now we'll store the
key ID used to sign a workload identity in the Allocation, and we'll index the
Allocation table on that so we can track whether any live Allocation was signed
with a particular key ID.
2022-11-01 15:00:50 -04:00
Tim Gross
b363c56c96 docs: improved documentation on hardening and required capabilities (#15036)
The existing docs on required capabilities are a little sparse and have been the
subject of a lots of questions. Expand on this information and provide a pointer
to the ongoing design discussion around rootless Nomad.
2022-10-26 09:46:13 -04:00
Tim Gross
b583f7822a keyring: remove root key GC (#15034) 2022-10-25 17:06:18 -04:00
Zach Shilton
563e5e3d57 docs: add details to redirects file (#15020) 2022-10-24 13:16:07 -04:00
Luiz Aoqui
f2318ed2ec docs: use of node_class when autoscaling (#14950)
Document how the value of `node_class` is used during cluster scaling.

https://github.com/hashicorp/nomad-autoscaler/issues/255
2022-10-21 10:35:45 -04:00
James Rasell
1c9b4e398d acl: add ACL roles to event stream topic and resolve policies. (#14923)
This changes adds ACL role creation and deletion to the event
stream. It is exposed as a single topic with two types; the filter
is primarily the role ID but also includes the role name.

While conducting this work it was also discovered that the events
stream has its own ACL resolution logic. This did not account for
ACL tokens which included role links, or tokens with expiry times.
ACL role links are now resolved to their policies and tokens are
checked for expiry correctly.
2022-10-20 09:43:35 +02:00
James Rasell
eaea9164a5 acl: correctly resolve ACL roles within client cache. (#14922)
The client ACL cache was not accounting for tokens which included
ACL role links. This change modifies the behaviour to resolve role
links to policies. It will also now store ACL roles within the
cache for quick lookup. The cache TTL is configurable in the same
manner as policies or tokens.

Another small fix is included that takes into account the ACL
token expiry time. This was not included, which meant tokens with
expiry could be used past the expiry time, until they were GC'd.
2022-10-20 09:37:32 +02:00
Luiz Aoqui
56816f2f93 docs: expand Autoscaling documentation (#14937)
Rename `Internals` section to `Concepts` to match core docs structure
and expand on how policies are evaluated.

Also include missing documentation for check grouping and fix examples
to use the new feature.
2022-10-19 17:57:08 -04:00
Luiz Aoqui
3fd800c600 docs: add autoscaling debug (#14941) 2022-10-19 14:17:41 -04:00
Luiz Aoqui
38606a6a5b docs: move autoscaling source agent config (#14947)
Move the Autoscaler agent configuration `source` to the `policy` page
since they are very closely related.

Also update all headers in this section so they follow the proper `h1 >
h2 > h3 > ...` hierarchy.
2022-10-19 14:17:09 -04:00
Luiz Aoqui
876ea90075 docs: explain autoscaler target-value strategy (#14951)
Provide more technical details about how the `target-value` strategy
calculates new scaling actions.
2022-10-19 14:16:17 -04:00
Zach Shilton
c81fe3cf40 website: fix broken links (#14946)
* fix: nomad license put link

* fix: redirected URL

* fix: avoid auto-formatting changes
2022-10-19 14:07:48 -04:00
Anthony
6dcf008fbb Updated datacenter block description (#14953)
* Updated datacenter block description

* Replacing accidentally removed title

* docs: add closing period

Co-authored-by: Seth Hoenig <shoenig@duck.com>
2022-10-19 08:44:52 -05:00
HashiBot
bf279ac019 chore: Update Digital Team Files (#14945)
* Update generated scripts (website-start.sh)

* Update generated scripts (should-build.sh)

* Update generated scripts (website-build.sh)

* Update generated website Makefile
2022-10-18 17:43:31 -04:00
HashiBot
c9bd653815 chore: Update Digital Team Files (#14940)
* Update generated scripts (should-build.sh)

* Update generated scripts (website-build.sh)

* Update generated scripts (website-start.sh)

* Update generated website Makefile
2022-10-18 12:36:24 -04:00
Zach Shilton
cc2b449911 website: redirects to empty array (#14921) 2022-10-18 11:57:36 -04:00
Bryce Kalow
f49b3a95dd website: fixes redirected links (#14918) 2022-10-18 10:31:52 -05:00
Kevin Wang
57dc7c2ab1 fix: website broken links (#14904)
* fix: website broken links

* fix up keyring-rotate link

Co-authored-by: Tim Gross <tgross@hashicorp.com>
2022-10-17 11:32:10 -04:00
Seth Hoenig
9e7e5e081e services: remove assertion on 'task' field being set (#14864)
This PR removes the assertion around when the 'task' field of
a check may be set. Starting in Nomad 1.4 we automatically set
the task field on all checks in support of the NSD checks feature.

This is causing validation problems elsewhere, e.g. when a group
service using the Consul provider sets 'task' it will fail
validation that worked previously.

The assertion of leaving 'task' unset was only about making sure
job submitters weren't expecting some behavior, but in practice
is causing bugs now that we need the task field for more than it
was originally added for.

We can simply update the docs, noting when the task field set by
job submitters actually has value.
2022-10-10 13:02:33 -05:00
Damian Czaja
e4efedbbe4 cli: add nomad fmt (#14779) 2022-10-06 17:00:29 -04:00
Giovani Avelar
2b9158b73e Allow specification of a custom job name/prefix for parameterized jobs (#14631) 2022-10-06 16:21:40 -04:00
Michael Schurter
0779a5bc10 docs: clarify nomad vars vs vault (#14831)
* docs: clarify nomad vars vs vault

I think we should make the difference in root key management between
Nomad and Vault clear in the concept docs. I didn't see anywhere else in
the docs we compared it.

I also s/secrets/variables everywhere except the first sentence since
the feature is intended to be more generic than secrets. Right now it's
more of a compliment to Consul's kv than Vault due to root key handling
and featureset.

* Update website/content/docs/concepts/variables.mdx

Co-authored-by: Tim Gross <tgross@hashicorp.com>
2022-10-06 13:17:26 -07:00
HashiBot
873d4f33c8 website: upgrade next version (#14830)
Co-authored-by: Bryce Kalow <bkalow@hashicorp.com>
2022-10-06 13:48:11 -05:00
Tim Gross
f70fcf659e docs: 1.4.0 upgrade warning for keyring initialization (#14825) 2022-10-06 11:32:35 -04:00
Elijah Voigt
5fdcbf085f Docs(job-specification/periodic): Add enabled toggle (#14767)
This is probably undocumented for a reason, but the `enabled` toggle in the
`periodic` stanza is very useful so I figured I try adding it to the docs.

The feature has been secretly avaliable since #9142 and was called out in that
PR as being a dubious addition, only added to avoid regressions.

The use case for disabling a periodic job in this way is to prevent it from
running without modifying the schedule. Ideally Nomad would make it more clear
that this was the case, and allow you to force a run of the job, but even with
those rough edges I think users would benefit from knowing about this toggle.
2022-10-03 15:08:24 -04:00
Tim Gross
98deb8d8a0 internals documentation with diagrams (#14750)
This changeset adds new architecture internals documents to the contributing
guide. These are intentionally here and not on the public-facing website because
the material is not required for operators and includes a lot of diagrams that
we can cheaply maintain with mermaid syntax but would involve art assets to have
up on the main site that would become quickly out of date as code changes happen
and be extremely expensive to maintain. However, these should be suitable to use
as points of conversation with expert end users.

Included:
* A description of Evaluation triggers and expected counts, with examples.
* A description of Evaluation states and implicit states. This is taken from an
  internal document in our team wiki.
* A description of how writing the State Store works. This is taken from a
  diagram I put together a few months ago for internal education purposes.
* A description of Evaluation lifecycle, from registration to running
  Allocations. This is mostly lifted from @lgfa29's amazing mega-diagram, but
  broken into digestible chunks and without multi-region deployments, which I'd
  like to cover in a future doc.

Also includes adding Deployments to our public-facing glossary.

Co-authored-by: Luiz Aoqui <luiz@hashicorp.com>
Co-authored-by: Michael Schurter <mschurter@hashicorp.com>
Co-authored-by: Seth Hoenig <shoenig@duck.com>
2022-10-03 14:06:41 -04:00
dependabot[bot]
6aee370969 build(deps-dev): bump @hashicorp/platform-cli in /website (#14541)
Bumps [@hashicorp/platform-cli](https://github.com/hashicorp/web-platform-packages/tree/HEAD/packages/cli) from 2.1.0 to 2.3.0.
- [Release notes](https://github.com/hashicorp/web-platform-packages/releases)
- [Changelog](https://github.com/hashicorp/web-platform-packages/blob/main/packages/cli/CHANGELOG.md)
- [Commits](https://github.com/hashicorp/web-platform-packages/commits/@hashicorp/platform-cli@2.3.0/packages/cli)

---
updated-dependencies:
- dependency-name: "@hashicorp/platform-cli"
  dependency-type: direct:development
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2022-09-30 14:59:55 -04:00
Tim Gross
fb1f5ea2d9 Revert removing deprecated client options docs (#14753)
This reverts PR #12416 and commit 6668ce022a.

While the driver options are well and truly deprecated, this documentation also
covers features like `fingerprint.denylist` that are not available any other
way. Let's revert this until #12420 is ready.
2022-09-30 08:38:03 -04:00
Derek Strickland
58e76c64d5 Merge pull request #14664 from hashicorp/docs-multiregion-dispatch
multiregion: Added a section for multiregion parameterized job dispatch
2022-09-28 15:40:11 -04:00
Derek Strickland
3c63967107 link from dispatch command 2022-09-28 08:30:22 -04:00
Derek Strickland
2c1df34fee Apply suggestions from code review 2022-09-28 08:18:56 -04:00
Derek Strickland
998f662ecd Update website/content/docs/job-specification/multiregion.mdx
Co-authored-by: Luiz Aoqui <luiz@hashicorp.com>
2022-09-28 07:20:11 -04:00
Derek Strickland
6ac87c396f Update website/content/docs/job-specification/multiregion.mdx
Co-authored-by: Luiz Aoqui <luiz@hashicorp.com>
2022-09-28 07:19:54 -04:00
Seth Hoenig
1e5f6188fb core: numeric operands comparisons in constraints (#14722)
* cleanup: fixup linter warnings in schedular/feasible.go

* core: numeric operands comparisons in constraints

This PR changes constraint comparisons to be numeric rather than
lexical if both operands are integers or floats.

Inspiration #4856
Closes #4729
Closes #14719

* fix: always parse as int64
2022-09-27 11:07:07 -05:00
Michael Schurter
a6dc5ea585 docs: write a lot of words about heartbeats (#14679)
* docs: write a lot of words about heartbeats

Alternative to #14670

* Apply suggestions from code review

Co-authored-by: Tim Gross <tgross@hashicorp.com>

* use descriptive title for link

* rework example of high failover ttl

Co-authored-by: Tim Gross <tgross@hashicorp.com>
2022-09-26 14:43:34 -07:00
Michael Schurter
2e059c624f fingerprint: add node attr for reserverable cores (#14694)
* fingerprint: add node attr for reserverable cores

Add an attribute for the number of reservable CPU cores as they may
differ from the existing `cpu.numcores` due to client configuration or
OS support.

Hopefully clarifies some confusion in #14676

* add changelog

* num_reservable_cores -> reservablecores
2022-09-26 13:03:03 -07:00