Commit Graph

3453 Commits

Author SHA1 Message Date
stswidwinski
2285432424 GC: ensure no leakage of evaluations for batch jobs. (#15097)
Prior to 2409f72 the code compared the modification index of a job to itself. Afterwards, the code compared the creation index of the job to itself. In either case there should never be a case of re-parenting of allocs causing the evaluation to trivially always result in false, which leads to unreclaimable memory.

Prior to this change allocations and evaluations for batch jobs were never garbage collected until the batch job was explicitly stopped. The new `batch_eval_gc_threshold` server configuration controls how often they are collected. The default threshold is `24h`.
2023-01-31 13:32:14 -05:00
Jorge Marey
340ad2db58 Rename fields on proxyConfig (#15541)
* Change api Fields for expose and paths

* Add changelog entry

* changelog: add deprecation notes about connect fields

* api: minor style tweaks

---------

Co-authored-by: Seth Hoenig <shoenig@duck.com>
2023-01-30 09:31:16 -06:00
Piotr Kazmierczak
949a6f60c7 renamed stanza to block for consistency with other projects (#15941) 2023-01-30 15:48:43 +01:00
James Rasell
166aee71f3 cli: separate auth method config output for easier reading. (#15892) 2023-01-30 11:44:26 +01:00
Seth Hoenig
ce71d2dc0d consul: check for acceptable service identity on consul tokens (#15928)
When registering a job with a service and 'consul.allow_unauthenticated=false',
we scan the given Consul token for an acceptable policy or role with an
acceptable policy, but did not scan for an acceptable service identity (which
is backed by an acceptable virtual policy). This PR updates our consul token
validation to also accept a matching service identity when registering a service
into Consul.

Fixes #15902
2023-01-27 18:15:51 -06:00
Tim Gross
e53b591582 metrics: Add remaining server RPC rate metrics (#15901) 2023-01-27 08:29:53 -05:00
Piotr Kazmierczak
0abadb6804 acl: make auth method default across all types (#15869) 2023-01-26 14:17:11 +01:00
James Rasell
14fb036473 sso: allow binding rules to create management ACL tokens. (#15860)
* sso: allow binding rules to create management ACL tokens.

* docs: update binding rule docs to detail management type addition.
2023-01-26 09:57:44 +01:00
Tim Gross
bcd5bbdad7 add metric for count of RPC requests (#15515)
Implement a metric for RPC requests with labels on the identity, so that
administrators can monitor the source of requests within the cluster. This
changeset demonstrates the change with the new `ACL.WhoAmI` RPC, and we'll wire
up the remaining RPCs once we've threaded the new pre-forwarding authentication
through the all.

Note that metrics are measured after we forward but before we return any
authentication error. This ensures that we only emit metrics on the server that
actually serves the request. We'll perform rate limiting at the same place.

Includes telemetry configuration to omit identity labels.
2023-01-24 11:54:20 -05:00
Tim Gross
b79d00abf3 implement pre-forwarding auth on select RPCs (#15513)
In #15417 we added a new `Authenticate` method to the server that returns an
`AuthenticatedIdentity` struct. This changeset implements this method for a
small number of RPC endpoints that together represent all the various ways in
which RPCs are sent, so that we can validate that we're happy with this
approach.
2023-01-24 10:52:07 -05:00
Karl Johann Schubert
588392cabc client: add disk_total_mb and disk_free_mb config options (#15852) 2023-01-24 09:14:22 -05:00
Charlie Voiselle
85f67d4a83 Add raft snapshot configuration options (#15522)
* Add config elements
* Wire in snapshot configuration to raft
* Add hot reload of raft config
* Add documentation for new raft settings
* Add changelog
2023-01-20 14:21:51 -05:00
James Rasell
6a8728d00a cli: use localhost for default login callback address. (#15820) 2023-01-19 16:46:17 +01:00
James Rasell
859cb6e3fb Merge branch 'main' into sso/gh-13120-oidc-login 2023-01-18 10:05:31 +00:00
Phil Renaud
d57b805780 [sso] OIDC Updates for the UI (#15804)
* Updated UI to handle OIDC method changes

* Remove redundant store unload call
2023-01-17 17:01:47 -05:00
Dao Thanh Tung
af56eb8b7f fix bug in nomad fmt -check does not return error code (#15797) 2023-01-17 09:15:34 -05:00
James Rasell
531bada034 cli: add login command to allow OIDC provider SSO login. 2023-01-13 13:16:09 +00:00
James Rasell
0279d95b55 api: add OIDC HTTP API endpoints and SDK. 2023-01-13 13:15:58 +00:00
Seth Hoenig
4698d8da79 consul/connect: support for proxy upstreams opaque config (#15761)
This PR adds support for configuring `proxy.upstreams[].config` for
Consul Connect upstreams. This is an opaque config value to Nomad -
the data is passed directly to Consul and is unknown to Nomad.
2023-01-12 08:20:54 -06:00
Anthony Davis
abe088954e Fix rejoin_after_leave behavior (#15552) 2023-01-11 16:39:24 -05:00
Dao Thanh Tung
30b235345d cli: Add a nomad operator client state command (#15469)
Signed-off-by: dttung2905 <ttdao.2015@accountancy.smu.edu.sg>
2023-01-11 10:03:31 -05:00
Dao Thanh Tung
f89ac80801 agent: Make agent syslog log level inherit from Nomad agent log (#15625) 2023-01-04 09:38:06 -05:00
Tim Gross
e23b3a350e csi: Fix parsing of '=' in secrets at command line and HTTP (#15670)
The command line flag parsing and the HTTP header parsing for CSI secrets
incorrectly split at more than one '=' rune, making it impossible to use secrets
that included that rune.
2023-01-03 16:28:38 -05:00
Seth Hoenig
dab4d7ed7a ci: swap freeport for portal in packages (#15661) 2023-01-03 11:25:20 -06:00
Seth Hoenig
e2f912046b command: fixup parsing of stale query parameter (#15631)
In #15605 we fixed the bug where the presense of "stale" query parameter
was mean to imply stale, even if the value of the parameter was "false"
or malformed. In parsing, we missed the case where the slice of values
would be nil which lead to a failing test case that was missed because
CI didn't run against the original PR.
2023-01-03 08:21:20 -06:00
Seth Hoenig
2609f6a137 cleanup: remove usage of consul/sdk/testutil/retry (#15609)
This PR removes usages of `consul/sdk/testutil/retry`, as part of the
ongoing effort to remove use of any non-API module from Consul.

There is one remanining usage in the helper/freeport package, but that
will get removed as part of #15589
2023-01-02 08:06:20 -06:00
Dao Thanh Tung
1584496d96 fix: stale querystring parameter value as boolean (#15605)
* Add changes to make stale querystring param boolean

Signed-off-by: dttung2905 <ttdao.2015@accountancy.smu.edu.sg>

* Make error message more consistent

Signed-off-by: dttung2905 <ttdao.2015@accountancy.smu.edu.sg>

* Changes from code review + Adding CHANGELOG file

Signed-off-by: dttung2905 <ttdao.2015@accountancy.smu.edu.sg>

* Changes from code review to use github.com/shoenig/test package

Signed-off-by: dttung2905 <ttdao.2015@accountancy.smu.edu.sg>

* Change must.Nil() to must.NoError()

Signed-off-by: dttung2905 <ttdao.2015@accountancy.smu.edu.sg>

* Minor fix on the import order

Signed-off-by: dttung2905 <ttdao.2015@accountancy.smu.edu.sg>

* Fix existing code format too

Signed-off-by: dttung2905 <ttdao.2015@accountancy.smu.edu.sg>

* Minor changes addressing code review feedbacks

Signed-off-by: dttung2905 <ttdao.2015@accountancy.smu.edu.sg>

* swap must.EqOp() order of param provided

Signed-off-by: dttung2905 <ttdao.2015@accountancy.smu.edu.sg>

Signed-off-by: dttung2905 <ttdao.2015@accountancy.smu.edu.sg>
2023-01-01 13:04:14 -06:00
Seth Hoenig
5380a944ad command: fixup tests concerning multi job stop (#15606)
* command: fixup job multi-stop test

This PR refactors the StopCommand test that runs 10 jobs and then
passes them all to one invokation of 'job stop'.

* test: swap use of assert for must

* test: cleanup job files we create

* command: fixup job stop failure tests

Now that JobStop works on concurrent jobs, the error messages are
different.

* cleanup: use multiple post scripts
2022-12-21 16:21:48 -06:00
Seth Hoenig
3bb144c43f tests: do not return error from testagent shutdown (#15595) 2022-12-21 08:23:58 -06:00
Danish Prakash
16401b864e command/job_stop: accept multiple jobs, stop concurrently (#12582)
* command/job_stop: accept multiple jobs, stop concurrently

Signed-off-by: danishprakash <grafitykoncept@gmail.com>

* command/job_stop_test: add test for multiple job stops

Signed-off-by: danishprakash <grafitykoncept@gmail.com>

* improve output, add changelog and docs

Signed-off-by: danishprakash <grafitykoncept@gmail.com>
Co-authored-by: Michael Schurter <mschurter@hashicorp.com>
2022-12-16 15:46:58 -08:00
James Rasell
13c1275057 cli: add ACL binding rule commands for CRUD actions. (#15554) 2022-12-15 16:57:44 +01:00
James Rasell
4d60dd3dbb ACL: add ACL binding rule RPC and HTTP API handlers. (#15529)
This change add the RPC ACL binding rule handlers. These handlers
are responsible for the creation, updating, reading, and deletion
of binding rules.

The write handlers are feature gated so that they can only be used
when all federated servers are running the required version.

The HTTP API handlers and API SDK have also been added where
required. This allows the endpoints to be called from the API by users
and clients.
2022-12-15 09:18:55 +01:00
Piotr Kazmierczak
627debc14b acl: numerous small bugfixes for acl auth methods CLI (#15539)
This PR contains a number of small bugfixes discovered during #15538 work.
2022-12-14 13:25:40 +01:00
Piotr Kazmierczak
9dbe34ac05 bugfix: acl sso auth methods test failures (#15512)
This PR fixes unit test failures introduced in f4e89e2
2022-12-09 18:47:32 +01:00
Piotr Kazmierczak
7ee82dc21a acl: added type to ACL Auth Method stub (#15480) 2022-12-06 14:47:05 +01:00
Piotr Kazmierczak
2c196c3de4 bugfix: corrected indentation for ACL auth method create CLI command (#15481) 2022-12-06 14:45:24 +01:00
Seth Hoenig
86479386f8 consul: fixup expected consul tagged_addresses when using ipv6 (#15411)
This PR is a continuation of #14917, where we missed the ipv6 cases.

Consul auto-inserts tagged_addresses for keys
- lan_ipv4
- wan_ipv4
- lan_ipv6
- wan_ipv6

even though the service registration coming from Nomad does not contain such
elements. When doing the differential between services Nomad expects to be
registered vs. the services actually registered into Consul, we must first
purge these automatically inserted tagged_addresses if they do not exist in
the Nomad view of the Consul service.
2022-12-01 07:38:30 -06:00
Piotr Kazmierczak
fe1ff602f8 acl: sso auth methods RPC/API/CLI should return created or updated objects (#15410)
Currently CRUD code that operates on SSO auth methods does not return created or updated object upon creation/update. This is bad UX and inconsistent behavior compared to other ACL objects like roles, policies or tokens.

This PR fixes it.

Relates to #13120
2022-11-29 07:36:36 +01:00
Piotr Kazmierczak
2d83ce7b15 acl: sso auth methods cli commands (#15322)
This PR implements CLI commands to interact with SSO auth methods.

This PR is part of the SSO work captured under ☂️ ticket #13120.
2022-11-28 10:51:45 +01:00
Piotr Kazmierczak
ecd454e15d bugfix: typos in acl role commands (#15382)
Co-authored-by: James Rasell <jrasell@users.noreply.github.com>
2022-11-25 10:28:33 +01:00
Luiz Aoqui
d439fe0d90 cli: improve errors for multiregion deployments (#15326)
Co-authored-by: Piotr Kazmierczak <470696+pkazmierczak@users.noreply.github.com>
2022-11-23 16:40:13 -05:00
Jack
66c61e4fd2 cli: wait flag for use with deployment status -monitor (#15262) 2022-11-23 16:36:13 -05:00
James Rasell
84b79aa87d sso: add ACL auth-method HTTP API CRUD endpoints (#15338)
* core: remove custom auth-method TTLS and use ACL token TTLS.

* agent: add ACL auth-method HTTP endpoints for CRUD actions.

* api: add ACL auth-method client.
2022-11-23 09:38:02 +01:00
Lance Haig
8667dc2607 Add command "nomad tls" (#14296) 2022-11-22 14:12:07 -05:00
hc-github-team-nomad-core
87f3565fc6 Generate files for 1.4.3 release 2022-11-22 12:56:29 -05:00
Seth Hoenig
3b14db4b83 consul: add trace logging around service registrations (#15311)
This PR adds trace logging around the differential done between a Nomad service
registration and its corresponding Consul service registration, in an effort
to shed light on why a service registration request is being made.
2022-11-21 08:03:56 -06:00
James Rasell
faabc2b2c2 api: ensure ACL role upsert decode error returns a 400 status code. (#15253) 2022-11-18 17:47:43 +01:00
James Rasell
c495cd99bf api: ensure all request body decode error return a 400 status code. (#15252) 2022-11-18 17:04:33 +01:00
James Rasell
a3f3018227 agent: ensure all HTTP Server methods are pointer receivers. (#15250) 2022-11-15 16:31:44 +01:00
Tim Gross
65b3d01aab eval delete: move batching of deletes into RPC handler and state (#15117)
During unusual outage recovery scenarios on large clusters, a backlog of
millions of evaluations can appear. In these cases, the `eval delete` command can
put excessive load on the cluster by listing large sets of evals to extract the
IDs and then sending larges batches of IDs. Although the command's batch size
was carefully tuned, we still need to be JSON deserialize, re-serialize to
MessagePack, send the log entries through raft, and get the FSM applied.

To improve performance of this recovery case, move the batching process into the
RPC handler and the state store. The design here is a little weird, so let's
look a the failed options first:

* A naive solution here would be to just send the filter as the raft request and
  let the FSM apply delete the whole set in a single operation. Benchmarking with
  1M evals on a 3 node cluster demonstrated this can block the FSM apply for
  several minutes, which puts the cluster at risk if there's a leadership
  failover (the barrier write can't be made while this apply is in-flight).

* A less naive but still bad solution would be to have the RPC handler filter
  and paginate, and then hand a list of IDs to the existing raft log
  entry. Benchmarks showed this blocked the FSM apply for 20-30s at a time and
  took roughly an hour to complete.

Instead, we're filtering and paginating in the RPC handler to find a page token,
and then passing both the filter and page token in the raft log. The FSM apply
recreates the paginator using the filter and page token to get roughly the same
page of evaluations, which it then deletes. The pagination process is fairly
cheap (only abut 5% of the total FSM apply time), so counter-intuitively this
rework ends up being much faster. A benchmark of 1M evaluations showed this
blocked the FSM apply for 20-30ms at a time (typical for normal operations) and
completes in less than 4 minutes.

Note that, as with the existing design, this delete is not consistent: a new
evaluation inserted "behind" the cursor of the pagination will fail to be
deleted.
2022-11-14 14:08:13 -05:00