Commit Graph

25773 Commits

Author SHA1 Message Date
hc-github-team-nomad-core
e1333eb9f6 Prepare for next release 2024-05-07 07:06:12 +00:00
hc-github-team-nomad-core
e1a176c120 Generate files for 1.8.0-beta.1 release 2024-05-07 07:06:07 +00:00
Piotr Kazmierczak
d68d9c27e1 Prepare release 1.8.0-beta.1 2024-05-07 09:01:15 +02:00
James Rasell
5041460043 core: do not create evaluations within batch deregister endpoint. (#20510)
The batch deregister RPC endpoint is only used by the internal
garbage collection process, it is not exposed via the HTTP API or
used anywhere else.

The GC process ensures that a job can only be removed from state
if all related evaluations and allocations are in a state that
means they can also be removed from state. This means that we do
not need to create evaluations when jobs are being deregistered
via this endpoint.
2024-05-07 07:39:13 +01:00
Phil Renaud
16479af38d Jobs Index Page: Live Updates + Pagination (#20452)
* Hook and latch on the initial index

* Serialization and restart of controller and table

* de-log

* allocBlocks reimplemented at job model level

* totalAllocs doesnt mean on jobmodel what it did in steady.js

* Hamburgers to sausages

* Hacky way to bring new jobs back around and parent job handling in list view

* Getting closer to hook/latch

* Latch from update on hook from initialize, but fickle

* Note on multiple-watch problem

* Sensible monday morning comment removal

* use of abortController to handle transition and reset events

* Next token will now update when there's an on-page shift

* Very rough anti-jostle technique

* Demoable, now to move things out of route and into controller

* Into the controller, generally

* Smarter cancellations

* Reset abortController on index models run, and system/sysbatch jobs now have an improved groupCountSum computed property

* Prev Page reverse querying

* n+1th jobs existing will trigger nextToken/pagination display

* Start of a GET/POST statuses return

* Namespace fix

* Unblock tests

* Realizing to my small horror that this skipURLModification flag may be too heavy handed

* Lintfix

* Default liveupdates localStorage setting to true

* Pagination and index rethink

* Big uncoupling of watchable and url-append stuff

* Testfixes for region, search, and keyboard

* Job row class for test purposes

* Allocations in test now contain events

* Starting on the jobs list tests in earnest

* Forbidden state de-bubbling cleanup

* Job list page size fixes

* Facet/Search/Filter jobs list tests skipped

* Maybe it's the automatic mirage logging

* Unbreak task unit test

* Pre-sort sort

* styling for jobs list pagination and general PR cleanup

* moving from Job.ActiveDeploymentID to Job.LatestDeployment.ID

* modifyIndex-based pagination (#20350)

* modifyIndex-based pagination

* modifyIndex gets its own column and pagination compacted with icons

* A generic withPagination handler for mirage

* Some live-PR changes

* Pagination and button disabled tests

* Job update handling tests for jobs index

* assertion timeout in case of long setTimeouts

* assert.timeouts down to 500ms

* de-to-do

* Clarifying comment and test descriptions

* Bugfix: resizing your browser on the new jobs index page would make the viz grow forever (#20458)

* [ui] Searching and filtering options (#20459)

* Beginnings of a search box for filter expressions

* jobSearchBox integration test

* jobs list updateFilter initial test

* Basic jobs list filtering tests

* First attempt at side-by-side facets and search with a computed filter

* Weirdly close to an iterative approach but checked isnt tracked properly

* Big rework to make filter composition and decomposition work nicely with the url

* Namespace facet dropdown added

* NodePool facet dropdown added

* hdsFacet for future testing and basic namespace filtering test

* Namespace filter existence test

* Status filtering

* Node pool/dynamic facet test

* Test patchups

* Attempt at optimize test fix

* Allocation re-load on optimize page explainer

* The Big Un-Skip

* Post-PR-review cleanup

* todo-squashing

* [ui] Handle parent/child jobs with the paginated Jobs Index route (#20493)

* First pass at a non-watchQuery version

* Parameterized jobs get child fetching and jobs index status style for parent jobs

* Completed allocs vs Running allocs in a child-job context, and fix an issue where moving from parent to parent would not reset index

* Testfix and better handling empty-child-statuses-list

* Parent/child test case

* Dont show empty allocation-status bars for parent jobs with no children

* Splits Settings into 2 sections, sign-in/profile and user settings (#20535)

* Changelog
2024-05-06 17:09:37 -04:00
Phil Renaud
890c2ce713 Remove json linting while editing variables (#20529) 2024-05-03 16:33:33 -04:00
Daniel Bennett
cf87a556b3 api: new /v1/jobs/statuses endpoint for /ui/jobs page (#20130)
introduce a new API /v1/jobs/statuses, primarily for use in the UI,
which collates info about jobs, their allocations, and latest deployment.

currently the UI gets *all* of /v1/jobs and sorts and paginates them client-side
in the browser, and its "summary" column is based on historical summary data
(which can be visually misleading, and sometimes scary when a job has failed
at some point in the not-yet-garbage-collected past).

this does pagination and filtering and such, and returns jobs sorted by ModifyIndex,
so latest-changed jobs still come first. it pulls allocs and latest deployment
straight out of current state for more a more robust, holistic view of the job status.
it is less efficient per-job, due to the extra state lookups, but should be more efficient
per-page (excepting perhaps for job(s) with very-many allocs).

if a POST body is sent like `{"jobs": [{"namespace": "cool-ns", "id": "cool-job"}]}`,
then the response will be limited to that subset of jobs. the main goal here is to
prevent "jostling" the user in the UI when jobs come into and out of existence.

and if a blocking query is started with `?index=N`, then the query should only
unblock if jobs "on page" change, rather than any change to any of the state
tables being queried ("jobs", "allocs", and "deployment"), to save unnecessary
HTTP round trips.
2024-05-03 15:01:40 -05:00
Tim Gross
54fc146432 agent: add support for sdnotify protocol (#20528)
Nomad agents expect to receive `SIGHUP` to reload their configuration. The
signal handler for this is installed fairly late in agent startup, after the
client or server components are up and running. This means that configuration
management tools can potentially reload the configuration before the agent can
handle it, causing the agent to crash.

We don't want to allow configuration reload during client or server component
startup, because it would significantly complicate initialization. Instead,
we'll implement the systemd notify protocol. This causes systemd to block
sending configuration reload signals until the agent is actually ready. Users
can still bypass this by sending signals directly.

Note that there are several Go libraries that implement the sdnotify protocol,
but most are part of much larger projects which would create a lot of dependabot
burden. The bits of the protocol we need are extremely simple to implement in a
just a couple of functions.

For non-Linux or non-systemd Linux systems, this feature is a no-op. In future
work we could potentially implement service notification for Windows as well.

Fixes: https://github.com/hashicorp/nomad/issues/3885
2024-05-03 13:42:07 -04:00
Tim Gross
f41bc468eb consul: provide CONSUL_HTTP_TOKEN env var to tasks (#20519)
When available, we provide an environment variable `CONSUL_TOKEN` to tasks, but
this isn't the environment variable expected by the Consul CLI. Job
specifications like deploying an API Gateway become noticeably nicer if we can
instead provide the expected env var.
2024-05-03 11:30:33 -04:00
James Rasell
cd9e032855 deps: upgrade hashicorp/cap to v0.6.0 (#20517) 2024-05-03 15:30:48 +01:00
Tim Gross
f9dd120d29 cli: add -jwks-ca-file to Vault/Consul setup commands (#20518)
When setting up auth methods for Consul and Vault in production environments, we
can typically assume that the CA certificate for the JWKS endpoint will be in
the host certificate store (as part of the usual configuration management
cluster admins needs to do). But for quick demos with `-dev` agents, this won't
be the case.

Add a `-jwks-ca-file` parameter to the setup commands so that we can use this
tool to quickly setup WI with `-dev` agents running TLS.
2024-05-03 08:26:29 -04:00
Seth Hoenig
422d62df89 checklist: remove steps for openapi for rpc (#20515) 2024-05-02 08:53:45 -05:00
James Rasell
3f866a7e82 test: regenerate test TLS certificates. (#20511) 2024-05-02 13:58:32 +01:00
Michael Schurter
3aefc010d7 test: remove spurious print statements (#20503) 2024-05-01 09:47:56 -07:00
Tim Gross
77dc74a301 quota: ensure quota usage is freed when jobs are purged (#20492)
When a job is purged, we delete all its allocations and the client detects the
absense of the allocations to clean up its resources locally. But the client
won't be able to send an allocation status update in this case, which frees the
quota being used by that allocation. Instead, we need to free the quota usage
inside the state store immediately. To do so, we check if the allocation is
already client-terminal before copying it and passing it into the Enterprise
code for cleanup.

This commit also refactors the job delete to make it clear there's a single
caller of this alloc deletion path. This refactoring eliminates some wasteful
logic that queries the "allocs" table, allocates a slice of strings for their
IDs, and then queries the "allocs" table one-by-one for each of them for
deletion anyways.

Tests for this code can be found in the linked ENT repo PR.

Fixes: https://github.com/hashicorp/nomad-enterprise/issues/1422
Ref: https://hashicorp.atlassian.net/browse/NOMAD-620
Ref: https://github.com/hashicorp/nomad-enterprise/pull/1432
2024-05-01 08:44:22 -04:00
Piotr Kazmierczak
abe9c0803a e2e: unflake TestWorkloadIdentity/testNobody (#20499)
sometimes the container quits too fast
2024-04-30 18:17:14 +02:00
James Rasell
05a7bb53d3 cli: fix handling of scaling jobs which don't generate evals. (#20479)
In some cases, Nomad job scaling will not generate evaluations
such as parameterized jobs. This change fixes the CLI behaviour
in this case, and copies the job run command for consistency.
2024-04-30 10:32:31 +01:00
Tim Gross
ff2d9de592 Revert "E2E: skip Vault 1.16.1 for JWT compatibility test (#20301)" (#20484)
This reverts commit 45b36371a12ffae5b5bfaaeadb08f801fb6bc98d. Now that Vault
1.16.2 has shipped, the E2E test will pick up only a working version.

Closes: https://github.com/hashicorp/nomad/issues/20298
2024-04-26 09:36:09 -04:00
Seth Hoenig
5f64e42d73 client: fixup how alloc mounts directory are setup (#20463) 2024-04-26 07:29:52 -05:00
Seth Hoenig
7874d21881 docs: add exec2 task driver page (#20480) 2024-04-24 07:26:54 -05:00
Seth Hoenig
8ae1a0e356 docs: add docs around dynamic workload users (#20477) 2024-04-23 07:57:40 -05:00
Seth Hoenig
1dfc715721 docs: add docs for fsisolation.Unveil fs isolation mode (#20475) 2024-04-23 07:55:54 -05:00
Daniel Bennett
3ac3bc1cfe acl: token global mode can not be changed (#20464)
true up CLI and docs with API reality
2024-04-22 11:58:47 -05:00
Tim Gross
ea5f2f6748 acl: remove remaining unused nil ACL object handling (#20456)
As of #18754 which shipped in Nomad 1.7, we no longer need to nil-check the
object returned by ResolveACL if there's no error return, because in the case
where ACLs are disabled we return a special "ACLs disabled" ACL object. Checking
nil is not a bug but should be discouraged because it opens us up to future bugs
that would bypass ACLs.

We fixed a bunch of these cases in https://github.com/hashicorp/nomad/pull/20150
but I didn't update the semgrep rule, which meant we missed a few more. Update
the semgrep rule and fix the remaining cases.
2024-04-18 14:34:17 -04:00
Piotr Kazmierczak
048f4511e2 docs: correct nanoseconds to milliseconds for MeasureSince metrics (#20446) 2024-04-18 18:16:58 +02:00
dependabot[bot]
b25de662a1 chore(deps): bump github.com/docker/docker from 25.0.2+incompatible to 26.0.1+incompatible (#20389)
* chore(deps): bump github.com/docker/docker

Bumps [github.com/docker/docker](https://github.com/docker/docker) from 25.0.2+incompatible to 26.0.1+incompatible.
- [Release notes](https://github.com/docker/docker/releases)
- [Commits](https://github.com/docker/docker/compare/v25.0.2...v26.0.1)

---
updated-dependencies:
- dependency-name: github.com/docker/docker
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>

* include changelog

---------

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Tim Gross <tgross@hashicorp.com>
2024-04-18 11:35:09 -04:00
Tim Gross
e4fe564bba deps: update golang.org/x/net (#20434)
Although Nomad does not use HTTP2, vulnerability scans detect our version of
`golang.org/x/net` as having an HPACK DoS vuln (GHSA-4v7x-pqxf-cx7m). Upgrade
the library so as to quiet the alerts.

Fixes: https://github.com/hashicorp/nomad-enterprise/issues/1423
2024-04-18 10:34:35 -04:00
Tim Gross
b662f1e6e5 docs: fix incorrect dispatch payload limit in API docs (#20433)
The dispatch payload limit is limited to 16KiB, not 64KiB. It's correct in the
command docs but incorrect in the API docs.

Ref: https://github.com/hashicorp/nomad/blob/v1.7.7/nomad/job_endpoint.go#L36-L38
Fixes: https://github.com/hashicorp/nomad/issues/20432
2024-04-18 10:20:15 -04:00
Daniel Bennett
363d2370f3 test: change some helpers testing.T to .TB (#20427)
TB interface instead of T struct,
so they can be used in Benchmarks too
2024-04-17 14:03:12 -05:00
Tim Gross
6d58acd897 WI: ensure tasks within same alloc get different Consul tokens (#20411)
The `consul_hook` in the allocrunner gets a separate Consul token for each task,
even if the tasks' identities have the same name, but used the identity name as
the key to the alloc hook resources map. This means the last task in the group
overwrites the Consul tokens of all other tasks.

Fix this by adding the task name to the key in the allocrunner's
`consul_hook`. And update the taskrunner's `consul_hook` to expect the task
name in the key.

Fixes: https://github.com/hashicorp/nomad/issues/20374
Fixes: https://hashicorp.atlassian.net/browse/NOMAD-614
2024-04-17 11:29:58 -04:00
Juana De La Cuesta
64978662b6 Post 1.7.7 release (#20421)
Generate files for 1.7.7 release, prepare for next release and merge release 1.7.7 files
2024-04-17 10:44:32 +02:00
Daniel Bennett
ca1860ae76 state: enable more reverse sorting (#20410)
* mainly jobs endpoint
* update call sites
* add new sort helpers
* put sorting in a separate file
2024-04-16 15:10:11 -05:00
Tu Nguyen
79c07807f4 docs: update docs link in quick start (#20409) 2024-04-16 15:52:35 -04:00
Phil Renaud
5150adffc0 [ui] Fix a bug where promotion would be asked with no new canaries (#20408)
* Fix a UI bug where promotion would be asked with no new canaries

* Because we now make sure of your allocations, our test cases should more accurately reflect a state of a promotable workflow
2024-04-16 15:50:06 -04:00
Tim Gross
22bfcdecf1 docs: add missing copyright headers in Terraform examples (#20412) 2024-04-16 15:21:03 -04:00
Nick Wales
e014e8411c terraform: updates AWS example packer and terraform code (#19512)
The "Provision a Nomad cluster in the cloud" works in AWS with these updates:

- use an available ubuntu version
- uses hashicorp packages where possible
- updates Nvidia installation
- installs CNI plugins
2024-04-16 10:47:31 -04:00
Luiz Aoqui
9d4f7bcb68 mock_driver: fix fingreprint key (#20351)
The `mock_driver` is an internal task driver used mostly for testing and
simulating workloads. During the allocrunner v2 work (#4792) its name
changed from `mock_driver` to just `mock` and then back to
`mock_driver`, but the fingreprint key was kept as `driver.mock`.

This results in tasks configured with `driver = "mock"` to be scheduled
(because Nomad thinks the client has a task driver called `mock`), but
fail to actually run (because the Nomad client can't find a driver
called `mock` in its catalog).

Fingerprinting the right name prevents the job from being scheduled in
the first place.

Also removes mentions of the mock driver from documentation since its an
internal driver and not available in any production release.
2024-04-16 07:16:55 +01:00
Daniel Bennett
ee213c3ddd comment on Job.ModifyIndex vs Job.JobModifyIndex (#20393) 2024-04-15 16:39:16 -05:00
Daniel Bennett
30c0461048 systemd: comment on OOMScoreAdjust in service unit (#20392) 2024-04-15 16:35:41 -05:00
Tim Gross
745d1dbe10 deps: update go-getter (#20391) 2024-04-15 16:59:53 -04:00
Piotr Kazmierczak
0d14dd96ca eval_broker: track enqueue and dequeue times (#20329)
Adds new metrics to the eval broker that track times of evaluations enqueueing
and dequeueing.
2024-04-15 16:16:50 +02:00
Tim Gross
1739f94e84 docs: fix a broken link on the Consul index page (#20387) 2024-04-12 15:31:48 -04:00
Phil Renaud
f9c4d2bdf0 the hasBeenRestarted allocation property checks against its task events, which can sometimes be null (#20383) 2024-04-12 14:49:07 -04:00
Tim Gross
43281f6038 docs: provide guidance on using Consul DNS (#20369)
Add a standalone section to the Consul integration docs showing how to configure
both the Consul agent and the workload to take advantage of Consul DNS. Include
a reference to the new transparent proxy feature as well.

Fixes: https://github.com/hashicorp/nomad/issues/18305
2024-04-12 14:38:04 -04:00
Tim Gross
9cb1ef3e3d CNI: fix bugs in parsing strings to port number integers (#20379)
Ports are a maximum of uint16, but we have a few places in the recent tproxy
code where we were parsing them as 64-bit wide integers and then downcasting
them to `int`, which is technically unsafe and triggers code scanning alerts. In
practice we've validated the range elsewhere and don't build for 32-bit
platforms. This changeset fixes the parsing to make everything a bit more robust
and silence the alert.

Fixes: https://github.com/hashicorp/nomad-enterprise/security/code-scanning/444
2024-04-12 13:31:25 -04:00
Daniel Bennett
bd802e43d0 add LICENSE to release artifacts (#20345)
* add LICENSE(.txt) to zip that goes on releases.hashicorp.com
* add LICENSE(.txt) to linux packages and docker image
* add some more docker labels (including license)
2024-04-12 10:57:15 -05:00
Tim Gross
d40e23f939 E2E: clean up go mod cache after building consul-cni (#20378)
In #20296 we added a Go tool chain to the AMI we use for E2E tests, so that we
can build `consul-cni` for tproxy testing. This is intended to be temporary
until `consul-k8s` 1.4.2 is officially released. But the Go cache from building
`consul-k8s` uses up roughly 1.5GiB of space and the test machines have fairly
small disks. This causes the Nomad clients to aggressively GC client allocations
that stop, which breaks tests that run batch workloads and then read their logs.
2024-04-12 11:52:46 -04:00
Seth Hoenig
ae6c4c8e3f deps: purge use of old x/exp packages (#20373) 2024-04-12 08:29:00 -05:00
Tim Gross
1e50090776 docs: clarify "best effort" for ephemeral disk migration (#20357)
The docs for ephemeral disk migration use the term "best effort" without
outlining the requirements or the cases under which the migration can
fail. Update the docs to make it obvious that ephemeral disk migration is
subject to data loss.

Fixes: https://github.com/hashicorp/nomad/issues/20355
2024-04-11 16:35:22 -04:00
astudentofblake
7b7ed12326 func: Allow custom paths to be added the the getter landlock (#20349)
* func: Allow custom paths to be added the the getter landlock

Fixes: 20315

* fix: slices imports
fix: more meaningful examples
fix: improve documentation
fix: quote error output
2024-04-11 15:17:33 -05:00