The batch deregister RPC endpoint is only used by the internal
garbage collection process, it is not exposed via the HTTP API or
used anywhere else.
The GC process ensures that a job can only be removed from state
if all related evaluations and allocations are in a state that
means they can also be removed from state. This means that we do
not need to create evaluations when jobs are being deregistered
via this endpoint.
* Hook and latch on the initial index
* Serialization and restart of controller and table
* de-log
* allocBlocks reimplemented at job model level
* totalAllocs doesnt mean on jobmodel what it did in steady.js
* Hamburgers to sausages
* Hacky way to bring new jobs back around and parent job handling in list view
* Getting closer to hook/latch
* Latch from update on hook from initialize, but fickle
* Note on multiple-watch problem
* Sensible monday morning comment removal
* use of abortController to handle transition and reset events
* Next token will now update when there's an on-page shift
* Very rough anti-jostle technique
* Demoable, now to move things out of route and into controller
* Into the controller, generally
* Smarter cancellations
* Reset abortController on index models run, and system/sysbatch jobs now have an improved groupCountSum computed property
* Prev Page reverse querying
* n+1th jobs existing will trigger nextToken/pagination display
* Start of a GET/POST statuses return
* Namespace fix
* Unblock tests
* Realizing to my small horror that this skipURLModification flag may be too heavy handed
* Lintfix
* Default liveupdates localStorage setting to true
* Pagination and index rethink
* Big uncoupling of watchable and url-append stuff
* Testfixes for region, search, and keyboard
* Job row class for test purposes
* Allocations in test now contain events
* Starting on the jobs list tests in earnest
* Forbidden state de-bubbling cleanup
* Job list page size fixes
* Facet/Search/Filter jobs list tests skipped
* Maybe it's the automatic mirage logging
* Unbreak task unit test
* Pre-sort sort
* styling for jobs list pagination and general PR cleanup
* moving from Job.ActiveDeploymentID to Job.LatestDeployment.ID
* modifyIndex-based pagination (#20350)
* modifyIndex-based pagination
* modifyIndex gets its own column and pagination compacted with icons
* A generic withPagination handler for mirage
* Some live-PR changes
* Pagination and button disabled tests
* Job update handling tests for jobs index
* assertion timeout in case of long setTimeouts
* assert.timeouts down to 500ms
* de-to-do
* Clarifying comment and test descriptions
* Bugfix: resizing your browser on the new jobs index page would make the viz grow forever (#20458)
* [ui] Searching and filtering options (#20459)
* Beginnings of a search box for filter expressions
* jobSearchBox integration test
* jobs list updateFilter initial test
* Basic jobs list filtering tests
* First attempt at side-by-side facets and search with a computed filter
* Weirdly close to an iterative approach but checked isnt tracked properly
* Big rework to make filter composition and decomposition work nicely with the url
* Namespace facet dropdown added
* NodePool facet dropdown added
* hdsFacet for future testing and basic namespace filtering test
* Namespace filter existence test
* Status filtering
* Node pool/dynamic facet test
* Test patchups
* Attempt at optimize test fix
* Allocation re-load on optimize page explainer
* The Big Un-Skip
* Post-PR-review cleanup
* todo-squashing
* [ui] Handle parent/child jobs with the paginated Jobs Index route (#20493)
* First pass at a non-watchQuery version
* Parameterized jobs get child fetching and jobs index status style for parent jobs
* Completed allocs vs Running allocs in a child-job context, and fix an issue where moving from parent to parent would not reset index
* Testfix and better handling empty-child-statuses-list
* Parent/child test case
* Dont show empty allocation-status bars for parent jobs with no children
* Splits Settings into 2 sections, sign-in/profile and user settings (#20535)
* Changelog
introduce a new API /v1/jobs/statuses, primarily for use in the UI,
which collates info about jobs, their allocations, and latest deployment.
currently the UI gets *all* of /v1/jobs and sorts and paginates them client-side
in the browser, and its "summary" column is based on historical summary data
(which can be visually misleading, and sometimes scary when a job has failed
at some point in the not-yet-garbage-collected past).
this does pagination and filtering and such, and returns jobs sorted by ModifyIndex,
so latest-changed jobs still come first. it pulls allocs and latest deployment
straight out of current state for more a more robust, holistic view of the job status.
it is less efficient per-job, due to the extra state lookups, but should be more efficient
per-page (excepting perhaps for job(s) with very-many allocs).
if a POST body is sent like `{"jobs": [{"namespace": "cool-ns", "id": "cool-job"}]}`,
then the response will be limited to that subset of jobs. the main goal here is to
prevent "jostling" the user in the UI when jobs come into and out of existence.
and if a blocking query is started with `?index=N`, then the query should only
unblock if jobs "on page" change, rather than any change to any of the state
tables being queried ("jobs", "allocs", and "deployment"), to save unnecessary
HTTP round trips.
Nomad agents expect to receive `SIGHUP` to reload their configuration. The
signal handler for this is installed fairly late in agent startup, after the
client or server components are up and running. This means that configuration
management tools can potentially reload the configuration before the agent can
handle it, causing the agent to crash.
We don't want to allow configuration reload during client or server component
startup, because it would significantly complicate initialization. Instead,
we'll implement the systemd notify protocol. This causes systemd to block
sending configuration reload signals until the agent is actually ready. Users
can still bypass this by sending signals directly.
Note that there are several Go libraries that implement the sdnotify protocol,
but most are part of much larger projects which would create a lot of dependabot
burden. The bits of the protocol we need are extremely simple to implement in a
just a couple of functions.
For non-Linux or non-systemd Linux systems, this feature is a no-op. In future
work we could potentially implement service notification for Windows as well.
Fixes: https://github.com/hashicorp/nomad/issues/3885
When available, we provide an environment variable `CONSUL_TOKEN` to tasks, but
this isn't the environment variable expected by the Consul CLI. Job
specifications like deploying an API Gateway become noticeably nicer if we can
instead provide the expected env var.
When setting up auth methods for Consul and Vault in production environments, we
can typically assume that the CA certificate for the JWKS endpoint will be in
the host certificate store (as part of the usual configuration management
cluster admins needs to do). But for quick demos with `-dev` agents, this won't
be the case.
Add a `-jwks-ca-file` parameter to the setup commands so that we can use this
tool to quickly setup WI with `-dev` agents running TLS.
When a job is purged, we delete all its allocations and the client detects the
absense of the allocations to clean up its resources locally. But the client
won't be able to send an allocation status update in this case, which frees the
quota being used by that allocation. Instead, we need to free the quota usage
inside the state store immediately. To do so, we check if the allocation is
already client-terminal before copying it and passing it into the Enterprise
code for cleanup.
This commit also refactors the job delete to make it clear there's a single
caller of this alloc deletion path. This refactoring eliminates some wasteful
logic that queries the "allocs" table, allocates a slice of strings for their
IDs, and then queries the "allocs" table one-by-one for each of them for
deletion anyways.
Tests for this code can be found in the linked ENT repo PR.
Fixes: https://github.com/hashicorp/nomad-enterprise/issues/1422
Ref: https://hashicorp.atlassian.net/browse/NOMAD-620
Ref: https://github.com/hashicorp/nomad-enterprise/pull/1432
In some cases, Nomad job scaling will not generate evaluations
such as parameterized jobs. This change fixes the CLI behaviour
in this case, and copies the job run command for consistency.
This reverts commit 45b36371a12ffae5b5bfaaeadb08f801fb6bc98d. Now that Vault
1.16.2 has shipped, the E2E test will pick up only a working version.
Closes: https://github.com/hashicorp/nomad/issues/20298
As of #18754 which shipped in Nomad 1.7, we no longer need to nil-check the
object returned by ResolveACL if there's no error return, because in the case
where ACLs are disabled we return a special "ACLs disabled" ACL object. Checking
nil is not a bug but should be discouraged because it opens us up to future bugs
that would bypass ACLs.
We fixed a bunch of these cases in https://github.com/hashicorp/nomad/pull/20150
but I didn't update the semgrep rule, which meant we missed a few more. Update
the semgrep rule and fix the remaining cases.
Although Nomad does not use HTTP2, vulnerability scans detect our version of
`golang.org/x/net` as having an HPACK DoS vuln (GHSA-4v7x-pqxf-cx7m). Upgrade
the library so as to quiet the alerts.
Fixes: https://github.com/hashicorp/nomad-enterprise/issues/1423
The `consul_hook` in the allocrunner gets a separate Consul token for each task,
even if the tasks' identities have the same name, but used the identity name as
the key to the alloc hook resources map. This means the last task in the group
overwrites the Consul tokens of all other tasks.
Fix this by adding the task name to the key in the allocrunner's
`consul_hook`. And update the taskrunner's `consul_hook` to expect the task
name in the key.
Fixes: https://github.com/hashicorp/nomad/issues/20374
Fixes: https://hashicorp.atlassian.net/browse/NOMAD-614
* Fix a UI bug where promotion would be asked with no new canaries
* Because we now make sure of your allocations, our test cases should more accurately reflect a state of a promotable workflow
The "Provision a Nomad cluster in the cloud" works in AWS with these updates:
- use an available ubuntu version
- uses hashicorp packages where possible
- updates Nvidia installation
- installs CNI plugins
The `mock_driver` is an internal task driver used mostly for testing and
simulating workloads. During the allocrunner v2 work (#4792) its name
changed from `mock_driver` to just `mock` and then back to
`mock_driver`, but the fingreprint key was kept as `driver.mock`.
This results in tasks configured with `driver = "mock"` to be scheduled
(because Nomad thinks the client has a task driver called `mock`), but
fail to actually run (because the Nomad client can't find a driver
called `mock` in its catalog).
Fingerprinting the right name prevents the job from being scheduled in
the first place.
Also removes mentions of the mock driver from documentation since its an
internal driver and not available in any production release.
Add a standalone section to the Consul integration docs showing how to configure
both the Consul agent and the workload to take advantage of Consul DNS. Include
a reference to the new transparent proxy feature as well.
Fixes: https://github.com/hashicorp/nomad/issues/18305
Ports are a maximum of uint16, but we have a few places in the recent tproxy
code where we were parsing them as 64-bit wide integers and then downcasting
them to `int`, which is technically unsafe and triggers code scanning alerts. In
practice we've validated the range elsewhere and don't build for 32-bit
platforms. This changeset fixes the parsing to make everything a bit more robust
and silence the alert.
Fixes: https://github.com/hashicorp/nomad-enterprise/security/code-scanning/444
* add LICENSE(.txt) to zip that goes on releases.hashicorp.com
* add LICENSE(.txt) to linux packages and docker image
* add some more docker labels (including license)
In #20296 we added a Go tool chain to the AMI we use for E2E tests, so that we
can build `consul-cni` for tproxy testing. This is intended to be temporary
until `consul-k8s` 1.4.2 is officially released. But the Go cache from building
`consul-k8s` uses up roughly 1.5GiB of space and the test machines have fairly
small disks. This causes the Nomad clients to aggressively GC client allocations
that stop, which breaks tests that run batch workloads and then read their logs.
The docs for ephemeral disk migration use the term "best effort" without
outlining the requirements or the cases under which the migration can
fail. Update the docs to make it obvious that ephemeral disk migration is
subject to data loss.
Fixes: https://github.com/hashicorp/nomad/issues/20355