Commit Graph

24949 Commits

Author SHA1 Message Date
Tim Gross
acfb4e679a docs: expand pprof documentation on goroutine profiles (#18172) 2023-08-08 08:33:42 -04:00
Devashish Taneja
472693d642 server: add config to tune job versions retention. #17635 (#17939) 2023-08-07 14:47:40 -04:00
Tim Gross
5d2c1d1f03 test: fix flaky RPC TLS enforcement test (#18155)
The RPC TLS enforcment test creates network connections to a server and these
are occassionally failing in testing with `write: broken pipe` errors. This has
been an ongoing issue where it'll appear to get fixed, then reoccur, and no one
seems to be able to reproduce outside of CI. The test assertion itself is
reliable, which is why it's been hard to spend effort to hunt this down.

The failing test cases are ones that are never supposed to work b/c they fail
our TLS cert role validation. The error message is coming from the TLS handshake
error. The RPC connection handler closes the connection immediately on getting
the error from the TLS handshake. The stdlib's TLS library flushes the
connection's buffer before returning the error. So the theory is that in the
failing case we don't get the error message before the connection is closed, but
do get the error return that allows the client to move on to a write, which
tries to write on the closed pipe.

I've been unable to reproduce this exactly, as the race is effectively between
the OS and the runtime. The equivalent test of the Raft TLS enforcement includes
handling of a EOF intead of the certificate error, so it appears this actually
expected (or at least known) behavior. Because the code under test is operating
as expected, this changeset updates the assertion to accept the error.
2023-08-07 11:17:06 -04:00
Abbas Yazdanpanah
388198abef CLI: make snapshot name requiered in creating volume snapshots (#17958)
Co-authored-by: James Rasell <jrasell@users.noreply.github.com>
2023-08-04 10:36:07 +01:00
Tim Gross
902f640c80 docs: fix URL in agent pprof examples (#18142) 2023-08-03 16:05:53 -04:00
dependabot[bot]
9551441dff build(deps): bump github.com/hashicorp/go-kms-wrapping/v2 (#17957)
Bumps [github.com/hashicorp/go-kms-wrapping/v2](https://github.com/hashicorp/go-kms-wrapping) from 2.0.8 to 2.0.12.
- [Commits](https://github.com/hashicorp/go-kms-wrapping/compare/v2.0.8...v2.0.12)

---
updated-dependencies:
- dependency-name: github.com/hashicorp/go-kms-wrapping/v2
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2023-08-03 15:43:14 -04:00
dependabot[bot]
02b572473b build(deps): bump github.com/opencontainers/runc from 1.1.5 to 1.1.8 (#18037)
Bumps [github.com/opencontainers/runc](https://github.com/opencontainers/runc) from 1.1.5 to 1.1.8.
- [Release notes](https://github.com/opencontainers/runc/releases)
- [Changelog](https://github.com/opencontainers/runc/blob/v1.1.8/CHANGELOG.md)
- [Commits](https://github.com/opencontainers/runc/compare/v1.1.5...v1.1.8)

---
updated-dependencies:
- dependency-name: github.com/opencontainers/runc
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2023-08-03 15:37:04 -04:00
dependabot[bot]
0d3f976a8a build(deps): bump github.com/hashicorp/consul/api from 1.18.0 to 1.23.0 (#18038)
Bumps [github.com/hashicorp/consul/api](https://github.com/hashicorp/consul) from 1.18.0 to 1.23.0.
- [Release notes](https://github.com/hashicorp/consul/releases)
- [Changelog](https://github.com/hashicorp/consul/blob/main/CHANGELOG.md)
- [Commits](https://github.com/hashicorp/consul/compare/api/v1.18.0...api/v1.23.0)

---
updated-dependencies:
- dependency-name: github.com/hashicorp/consul/api
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2023-08-03 15:01:34 -04:00
Tim Gross
b1742c7015 scheduler: filter device instance IDs by constraints (#18141)
When the scheduler assigns a device instance, it iterates over the feasible
devices and then picks the first instance with availability. If the jobspec uses
a constraint on device ID, this can lead to buggy/surprising behavior where the
node's device matches the constraint but then the individual device instance
does not.

Add a second filter based on the `${device.ids}` constraint after selecting a
node's device to ensure the device instance ID falls within the constraint as
well.

Fixes: #18112
2023-08-03 14:58:30 -04:00
James Rasell
9707aafc5b test: add tests for allocNameIndex core funcs (#18136) 2023-08-03 15:43:50 +01:00
Karuppiah Natarajan
2fd508d4f1 docs: fix link for stopping an agent (#18130) 2023-08-02 11:51:45 -04:00
Tim Gross
8ad663d1de allocwatcher: don't destroy local allocdir after migration (#18108)
When ephemeral disks are migrated from an allocation on the same node,
allocation logs for the previous allocation are lost.

There are two workflows for the best-effort attempt to migrate the allocation
data between the old and new allocations. For previous allocations on other
clients (the "remote" workflow), we create a local allocdir and download the
data from the previous client into it. That data is then moved into the new
allocdir and we delete the allocdir of the previous alloc.

For "local" previous allocations we don't need to create an extra directory for
the previous allocation and instead move the files directly from one to the
other. But we still delete the old allocdir _entirely_, which includes all the
logs!

There doesn't seem to be any reason to destroy the local previous allocdir, as
the usual client garbage collection should destroy it later on when needed. By
not deleting it, the previous allocation's logs are still available for the user
to read.

Fixes: #18034
2023-08-02 09:41:46 -04:00
Charlie Voiselle
585b0533c0 [dep] bump golang.org/x/exp (#18102)
There are some refactorings that have to be made in the getter and state
where the api changed in `slices`

* Bump golang.org/x/exp
* Bump golang.org/x/exp in api
* Update job_endpoint_test
* [feedback] unexport sort function
2023-08-01 11:50:17 -04:00
Luiz Aoqui
768978883d cli: search all namespaces for node volumes (#17925)
When looking for CSI volumes to display in the `node status` command the
CLI needs to search all namespaces.
2023-08-01 09:55:39 -04:00
Kevin Schoonover
4841791c86 fingerprint: fix 'default' alias not added to interface specified by network_interface (#18096) 2023-08-01 08:35:31 -04:00
dependabot[bot]
511cb55633 build(deps): bump word-wrap from 1.2.3 to 1.2.4 in /ui (#17972)
Bumps [word-wrap](https://github.com/jonschlinkert/word-wrap) from 1.2.3 to 1.2.4.
- [Release notes](https://github.com/jonschlinkert/word-wrap/releases)
- [Commits](https://github.com/jonschlinkert/word-wrap/compare/1.2.3...1.2.4)

---
updated-dependencies:
- dependency-name: word-wrap
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2023-07-31 15:57:22 -04:00
Phil Renaud
18dd9e722f [ui] Job Variables page (#17964)
* Bones of a component that has job variable awareness

* Got vars listed woo

* Variables as its own subnav and some pathLinkedVariable perf fixes

* Automatic Access to Variables alerter

* Helper and component to conditionally render the right link

* A bit of cleanup post-template stuff

* testfix for looping right-arrow keynav bc we have a new subnav section

* A very roundabout way of ensuring that, if a job exists when saving a variable with a pathLinkedEntity of that job, its saved right through to the job itself

* hacky but an async version of pathLinkedVariable

* model-driven and async fetcher driven with cleanup

* Only run the update-job func if jobname is detected in var path

* Test cases begun

* Management token for variables to appear in tests

* Its a management token so it gets to see the clients tab under system jobs

* Pre-review cleanup

* More tests

* Number of requests test and small fix to groups-by-way-or-resource-arrays elsewhere

* Variable intro text tests

* Variable name re-use

* Simplifying our wording a bit

* parse json vs plainId

* Addressed PR feedback, including de-waterfalling
2023-07-31 15:04:36 -04:00
Tim Gross
4fb5bf9a16 cli: support wildcard namespace in alloc subcommands (#18095)
The alloc exec and filesystem/logs commands allow passing the `-job` flag to
select a random allocation. If the namespace for the command is set to `*`, the
RPC handler doesn't handle this correctly as it's expecting to query for a
specific job. Most commands handle this ambiguity by first verifying that only a
single object of the type in question exists (ex. a single node or job).

Update these commands so that when the `-job` flag is set we first verify
there's a single job that matches. This also allows us to extend the
functionality to allow for the `-job` flag to support prefix matching.

Fixes: #12097
2023-07-31 13:15:15 -04:00
Phil Renaud
66649d12a7 [ui] Search results are overloading filter with sorted results (#18053)
* Attempt at a varied end-result when sorting and searching

* Consider sort direction as well

* computed property dep update

* prioritizeSearchOrder and test

* Side-effecty but resets sort on search etc

* changelog
2023-07-31 13:07:27 -04:00
Tim Gross
1ef8ad8176 scheduler: fix panic in render_templates destructive update check (#18100)
In #18054 we introduced a new field `render_templates` in the `restart`
block. Previously changes to the `restart` block were always non-destructive in
the scheduler but we now need to check the new field so that we can update the
template runner. The check assumed that the block was always non-nil, which
causes panics in our scheduler tests.
2023-07-31 11:52:51 -04:00
Gunnar
76ebb3fe55 docs: added accessor info to Tuples in template.mdx (#18101) 2023-07-31 11:03:12 -04:00
Gerard Nguyen
9e98d694a6 feature: Add new field render_templates on restart block (#18054)
This feature is necessary when user want to explicitly re-render all templates on task restart.
E.g. to fetch all new secrets from Vault, even if the lease on the existing secrets has not been expired.
2023-07-28 11:53:32 -07:00
Tim Gross
b17c0f7ff9 GHA pinning updates (#18093)
Trusted Supply Chain Component Registry (TSCCR) enforcement starts Monday and an
internal report shows our semgrep action is pinned to a version that's not
currently permitted. Update all the action versions to whatever's the new
hotness to maximum the time-to-live on these until we have automated pinning
setup.

Also version bumps our chromedriver action, which randomly broke upstream today.
2023-07-28 11:49:57 -04:00
Luiz Aoqui
ee31916c3b cli: add help message for -consul-namespace (#18081)
Add missing help entry for the `-consul-namespace` flag in `nomad job
run`.
2023-07-28 10:22:59 -04:00
James Rasell
0a32d7ff5b docs: add allocation checks API documentation. (#18078) 2023-07-28 08:49:14 +01:00
Michael Schurter
d14362ec19 core: add jwks rpc and http api (#18035)
Add JWKS endpoint to HTTP API for exposing the root public signing keys used for signing workload identity JWTs.

Part 1 of N components as part of making workload identities consumable by third party services such as Consul and Vault. Identity attenuation (audience) and expiration (+renewal) are necessary to securely use workload identities with 3rd parties, so this merge does not yet document this endpoint.

---------

Co-authored-by: Tim Gross <tgross@hashicorp.com>
2023-07-27 11:27:17 -07:00
Piotr Kazmierczak
ee0b104785 build: support s390x architecture for linux (ent) (#18069)
Makefile changes required for supporting s390x builds and a corresponding
changelog entry.
2023-07-26 17:43:37 +02:00
Piotr Kazmierczak
0a5667c0c7 changelog entry for nomad-enterprise#1201 (#18071) 2023-07-26 16:48:15 +02:00
Ville Vesilehto
2c463bb038 chore(lint): use Go stdlib variables for HTTP methods and status codes (#17968) 2023-07-26 15:28:09 +01:00
Ville Vesilehto
5c9cd35055 chore(variable): Go stdlib vars for HTTP methods and status codes (#18062) 2023-07-26 14:30:11 +01:00
Ville Vesilehto
a8fd803176 chore(nodepool): Go stdlib vars for HTTP methods and status codes (#18061) 2023-07-26 14:23:28 +01:00
James Rasell
7f30444356 changelog: add entry for #18044 (#18056) 2023-07-25 13:04:19 +01:00
Phil Renaud
937d927af7 Default-sort variable keyvalues at serialization (#18051) 2023-07-24 14:25:29 -04:00
Luiz Aoqui
55723e5a3b website: add Nomad Ops to Tools (#18006) 2023-07-24 11:32:54 -04:00
James Rasell
738bdb213d build: update to go1.20.6 (#18044) 2023-07-24 16:13:22 +01:00
Bruce Lok
7173d3bc25 Add missing consul grpc config (#17943) 2023-07-24 12:39:23 +01:00
Lance Haig
03cde51720 Rename Function to reflect correct outcome. (#17948) 2023-07-24 10:43:51 +01:00
Kevin Mulvey
ea37488e54 check in stderrFrame is nil before logging stderrFrame.Data (#17815) 2023-07-24 09:33:14 +01:00
James Rasell
2a91bf4469 node-pool: fix validate name function comment typo. (#17927) 2023-07-24 08:28:05 +01:00
stswidwinski
b9a388f5df Retain task states for post stop tasks at the time of node GC (#18005)
* Retain task states for post stop tasks at the time of node GC
2023-07-21 10:55:00 -07:00
Tim Gross
4768c2a455 Merge pull request #18028 from hashicorp/post-1.6.1-release
Post 1.6.1 release
2023-07-21 11:31:34 -04:00
hc-github-team-nomad-core
0bcc20e9e5 Prepare for next release 2023-07-21 11:12:00 -04:00
hc-github-team-nomad-core
583f8773fa Generate files for 1.6.1 release 2023-07-21 11:09:15 -04:00
Phil Renaud
91e1bafbac Changelog entry for remote purge boot-out (#18026) 2023-07-21 09:21:02 -04:00
Luiz Aoqui
2b3dd86dc5 ui: handle node pool requests to older regions (#18021)
When accessing a region running a version of Nomad without node pools an
error was thrown because the request is handled by the nodes endpoint
which fails because it assumes `pools` is the node ID.
2023-07-21 09:16:49 -04:00
Luiz Aoqui
5d3639f304 ui: handle errors from unimplemented services (#18020)
When a request is made to an RPC service that doesn't exist (for
example, a cross-region request from a newer version of Nomad to an
older version that doesn't implement the endpoint) the application
should return an empty list as well.
2023-07-21 09:16:35 -04:00
Luiz Aoqui
f8b9b5c387 state: canonicalize namespace on restore (#18017)
The upgrade path to Nomad 1.6.0 requires canonicalizing the namespace in
order to set the default scheduler configuration values.

Previous implementation only canonicalized on namespace upsert
operations, which works for recent namespaces as those Raft transactions
are reapplied on upgrade.

But for older namespaces restore from a snapshot the code path did not
canonicalize them, leaving the scheduler configuration set as `nil`.
2023-07-20 16:04:51 -04:00
Tim Gross
f52912454d CSI: improve controller RPC reliability (#17996)
The CSI specification says that we "SHOULD" send no more than one in-flight
request per *volume* at a time, with an allowance for losing state
(ex. leadership transitions) which the plugins "SHOULD" handle gracefully. We
mostly successfully serialize node and controller RPCs for the same volume,
except when Nomad clients are lost. (See also
https://github.com/container-storage-interface/spec/issues/512)

These concurrency requirements in the spec fall short because Storage Provider
APIs aren't necessarily safe to call concurrently on the same host even for
_different_ volumes. For example, concurrently attaching AWS EBS volumes to an
EC2 instance results in a race for device names, which results in failure to
attach (because the device name is taken already and the API call fails) and
confused results when releasing claims. So in practice many CSI plugins rely on
k8s-specific sidecars for serializing storage provider API calls globally. As a
result, we have to be much more conservative about concurrency in Nomad than the
spec allows.

This changeset includes four major changes to fix this:
* Add a serializer method to the CSI volume RPC handler. When the RPC handler
  makes a destructive CSI Controller RPC, we send the RPC thru this serializer
  and only one RPC is sent at a time. Any other RPCs in flight will block.
* Ensure that requests go to the same controller plugin instance whenever
  possible by sorting by lowest client ID out of the plugin instances.
* Ensure that requests go to _healthy_ plugin instances only.
* Ensure that requests for controllers can go to a controller on any _live_
  node, not just ones eligible for scheduling (which CSI controllers don't care
  about)

Fixes: #15415
2023-07-20 14:51:51 -04:00
Phil Renaud
94112d8cfd Copy button added to variables title (#17935) 2023-07-20 14:16:33 -04:00
Phil Renaud
6bed12f693 Copy change to include the nomad/jobs all-access variable prefix (#17933) 2023-07-20 14:16:14 -04:00