Commit Graph

592 Commits

Author SHA1 Message Date
Michael Schurter
542b23e999 Accept Workload Identities for Client RPCs (#16254)
This change resolves policies for workload identities when calling Client RPCs. Previously only ACL tokens could be used for Client RPCs.

Since the same cache is used for both bearer tokens (ACL and Workload ID), the token cache size was doubled.

---------

Co-authored-by: James Rasell <jrasell@users.noreply.github.com>
2023-02-27 10:17:47 -08:00
Michael Schurter
d5f0db8a5e Task API / Dynamic Node Metadata E2E test fixes (#16219)
* taskapi: return Forbidden on bad credentials

Prior to this change a "Server error" would be returned when ACLs are
enabled which did not match when ACLs are disabled.

* e2e: love love love datacenter wildcard default

* e2e: skip windows nodes on linux only test

The Logfs are a bit weird because they're most useful when converted to
Printfs to make debugging the test much faster, but that makes CI noisy.

In a perfect world Go would expose how many tests are being run and we
could stream output live if there's only 1. For now I left these helpful
lines in as basically glorified comments.
2023-02-21 10:53:10 -08:00
Tim Gross
517ad9c5bf E2E: add multi-home networking to test infrastructure (#16218)
Add an Elastic Network Interface (ENI) to each Linux host, on a secondary subnet
we have provisioned in each AZ. Revise security groups as follows:

* Split out client security groups from servers so that we can't have clients
  accidentally accessing serf addresses or other unexpected cross-talk.
* Add new security groups for the secondary subnet that only allows
  communication within the security group so we can exercise behaviors with
  multiple IPs.

This changeset doesn't include any Nomad configuration changes needed to take
advantage of the extra network interface. I'll include those with testing for
PR #16217.
2023-02-20 10:08:28 +01:00
Seth Hoenig
511d0c1e70 artifact: protect against unbounded artifact decompression (1.5.0) (#16151)
* artifact: protect against unbounded artifact decompression

Starting with 1.5.0, set defaut values for artifact decompression limits.

artifact.decompression_size_limit (default "100GB") - the maximum amount of
data that will be decompressed before triggering an error and cancelling
the operation

artifact.decompression_file_count_limit (default 4096) - the maximum number
of files that will be decompressed before triggering an error and
cancelling the operation.

* artifact: assert limits cannot be nil in validation
2023-02-14 09:28:39 -06:00
Michael Schurter
6809b0b527 Dynamic Node Metadata (#15844)
Fixes #14617
Dynamic Node Metadata allows Nomad users, and their jobs, to update Node metadata through an API. Currently Node metadata is only reloaded when a Client agent is restarted.

Includes new UI for editing metadata as well.

---------

Co-authored-by: Phil Renaud <phil.renaud@hashicorp.com>
2023-02-07 14:42:25 -08:00
Seth Hoenig
c4957b0901 e2e: mark framework package as deprecated (#16075)
Nothing more motivating than lots of deprecation warnings
to get some code refactored.
2023-02-07 08:10:40 -06:00
Michael Schurter
9bab96ebd3 Task API via Unix Domain Socket (#15864)
This change introduces the Task API: a portable way for tasks to access Nomad's HTTP API. This particular implementation uses a Unix Domain Socket and, unlike the agent's HTTP API, always requires authentication even if ACLs are disabled.

This PR contains the core feature and tests but followup work is required for the following TODO items:

- Docs - might do in a followup since dynamic node metadata / task api / workload id all need to interlink
- Unit tests for auth middleware
- Caching for auth middleware
- Rate limiting on negative lookups for auth middleware

---------

Co-authored-by: Seth Hoenig <shoenig@duck.com>
2023-02-06 11:31:22 -08:00
Charlie Voiselle
fe4ff5be2a Add option to expose workload token to task (#15755)
Add `identity` jobspec block to expose workload identity tokens to tasks.

---------

Co-authored-by: Anders <mail@anars.dk>
Co-authored-by: Tim Gross <tgross@hashicorp.com>
Co-authored-by: Michael Schurter <mschurter@hashicorp.com>
2023-02-02 10:59:14 -08:00
Seth Hoenig
d881b23370 bootstrap: upgrade golangci-lint in prep for go1.20 (#16024)
This PR updates golangci-lint to work better with go1.20 - the previous
version would cause in oom on 'make check'.
2023-02-02 09:44:12 -06:00
Seth Hoenig
fcc6cfa849 consul: restore consul token when reverting a job (#15996)
* consul: reset consul token on job during registration of a reversion

* e2e: add test for reverting a job with a consul service

* cl: fixup cl entry
2023-02-01 14:02:45 -06:00
Seth Hoenig
d375f6043f e2e: remove unused consulacls directory (#15995)
This pile was deprecated when we starting using HCP Consul for e2e
instead of standing up our own cluster and managing Consuls at test
runtime.
2023-01-31 16:03:47 -06:00
Piotr Kazmierczak
949a6f60c7 renamed stanza to block for consistency with other projects (#15941) 2023-01-30 15:48:43 +01:00
Seth Hoenig
2470bf9128 nsd: block on removal of services (#15862)
* nsd: block on removal of services

This PR uses a WaitGroup to ensure workload removals are complete
before returning from ServiceRegistrationHandler.RemoveWorkload of
the nomad service provider. The de-registration of individual services
still occurs asynchrously, but we must block on the parent removal
call so that we do not race with further operations on the same set
of services - e.g. in the case of a task restart where we de-register
and then re-register the services in quick succession.

Fixes #15032

* nsd: add e2e test for initial failing check and restart
2023-01-26 08:17:57 -06:00
Seth Hoenig
4a9b0945b1 e2e: fixup reference to exported test type (#15786) 2023-01-17 12:13:57 -06:00
Seth Hoenig
f05aa6d5ec vault: configure user agent on Nomad vault clients (#15745)
* vault: configure user agent on Nomad vault clients

This PR attempts to set the User-Agent header on each Vault API client
created by Nomad. Still need to figure a way to set User-Agent on the
Vault client created internally by consul-template.

* vault: fixup find-and-replace gone awry
2023-01-10 10:39:45 -06:00
Seth Hoenig
865ee8d37c artifact: fix sandbox behavior when destination is shared alloc directory (#15712)
This PR fixes the artifact sandbox (new in Nomad 1.5) to allow downloading
artifacts into the shared 'alloc' directory made available to each task in
a common allocation. Previously we assumed the 'alloc' dir would be mounted
under the 'task' dir, but this is only the case in fs isolation: chroot; in
other modes the alloc dir is elsewhere.
2023-01-09 09:46:32 -06:00
Seth Hoenig
9db2d8a3fb e2e: fixup windows artifact download test cases (#15710)
- fix wrong task name for one case
- comment out git windows test (still need to setup git on e2e windows client)
2023-01-06 12:38:48 -06:00
Seth Hoenig
ab5ff3c651 e2e: disable disconnected clients test(s) (#15703)
The e2e suite is not in good shape right now; let's disable the tests that modify
agent / node state until we can get things working again. Also the one DC test
that was enabled still doesn't work anyway.
2023-01-06 08:52:37 -06:00
Seth Hoenig
cfc67c3422 client: sandbox go-getter subprocess with landlock (#15328)
* client: sandbox go-getter subprocess with landlock

This PR re-implements the getter package for artifact downloads as a subprocess.

Key changes include

On all platforms, run getter as a child process of the Nomad agent.
On Linux platforms running as root, run the child process as the nobody user.
On supporting Linux kernels, uses landlock for filesystem isolation (via go-landlock).
On all platforms, restrict environment variables of the child process to a static set.
notably TMP/TEMP now points within the allocation's task directory
kernel.landlock attribute is fingerprinted (version number or unavailable)
These changes make Nomad client more resilient against a faulty go-getter implementation that may panic, and more secure against bad actors attempting to use artifact downloads as a privilege escalation vector.

Adds new e2e/artifact suite for ensuring artifact downloading works.

TODO: Windows git test (need to modify the image, etc... followup PR)

* landlock: fixup items from cr

* cr: fixup tests and go.mod file
2022-12-07 16:02:25 -06:00
Seth Hoenig
6e4410a9b1 e2e: fix 1 of 4 client disconnect tests (#15357)
This PR modifies the disconnect helper job to run as root, which is necesary
for manipulating iptables as it does. Also re-organizes the final test logic
to wait for client re-connect before looking for the replacement (3rd) allocation
in case that client was needed to run the alloc (also giving the sheduler more
time to do its thing).

Skips the other 3 tests, which fail and I cannot yet figure out what is going on.
2022-11-22 08:51:53 -06:00
Seth Hoenig
2372c6d20c e2e: fixup oversubscription test case for jammy (#15347)
* e2e: fixup oversubscription test case for jammy

jammy uses cgroups v2, need to lookup the max memory limit from the
unified heirarchy format

* e2e: set constraint to require cgroups v2 on oversub docker test
2022-11-21 12:41:55 -06:00
Seth Hoenig
78593daaee e2e: jammy image needs latest java lts (#15323) 2022-11-18 14:36:36 -06:00
Seth Hoenig
7c254ccdb8 e2e: disable systemd stub dns in jammy image (#15286) 2022-11-17 09:50:44 -06:00
Seth Hoenig
0e3606afa0 e2e: swap bionic image for jammy (#15220) 2022-11-16 10:37:18 -06:00
Derek Strickland
7e8306e476 api: remove mapstructure tags fromPort struct (#12916)
This PR solves a defect in the deserialization of api.Port structs when returning structs from theEventStream.

Previously, the api.Port struct's fields were decorated with both mapstructure and hcl tags to support the network.port stanza's use of the keyword static when posting a static port value. This works fine when posting a job and when retrieving any struct that has an embedded api.Port instance as long as the value is deserialized using JSON decoding. The EventStream, however, uses mapstructure to decode event payloads in the api package. mapstructure expects an underlying field named static which does not exist. The result was that the Port.Value field would always be set to 0.

Upon further inspection, a few things became apparent.

The struct already has hcl tags that support the indirection during job submission.
Serialization/deserialization with both the json and hcl packages produce the desired result.
The use of of the mapstructure tags provided no value as the Port struct contains only fields with primitive types.
This PR:

Removes the mapstructure tags from the api.Port structs
Updates the job parsing logic to use hcl instead of mapstructure when decoding Port instances.
Closes #11044

Co-authored-by: DerekStrickland <dstrickland@hashicorp.com>
Co-authored-by: Piotr Kazmierczak <470696+pkazmierczak@users.noreply.github.com>
2022-11-08 11:26:28 +01:00
Seth Hoenig
3c17552d4d e2e: explicitly wait on task status in chroot download exec test (#15145)
Also add some debug log lines for this test, because it doesn't make sense
for the allocation to be complete yet a task in the allocation to be not
started yet, which is what the test failures are implying.
2022-11-04 09:50:11 -05:00
Michael Schurter
5ed74049e0 test: use port collision instead of cpu exhaustion (#14994)
Originally this test relied on Job 1 blocking Job 2 until Job 1 had a
terminal *ClientStatus.* Job 2 ensured it would get blocked using 2
mechanisms:

1. A constraint requiring it is placed on the same node as Job 1.
2. Job 2 would require all unreserved CPU on the node to ensure it would
   be blocked until Job 1's resources were free.

That 2nd assertion breaks if *any previous job is still running on the
target node!* That seems very likely to happen in the flaky world of our
e2e tests. In fact there may be some jobs we intentionally want running
throughout; in hindsight it was never safe to assume my test would be
the only thing scheduled when it ran.

*Ports to the rescue!* Reserving a static port means that both Job 2
will now block on Job 1 being terminal. It will only conflict with other
tests if those tests use that port *on every node.* I ensured no
existing tests were using the port I chose.

Other changes:
- Gave job a bit more breathing room resource-wise.
- Tightened timings a bit since previous failure ran into the `go test`
  time limit.
- Cleaned up the DumpEvals output. It's quite nice and handy now!
2022-10-21 07:53:26 -07:00
Seth Hoenig
0b69a52a40 e2e: convert flaky exec download in chroot unit test into e2e test (#14949)
Similar to https://github.com/hashicorp/nomad/pull/14710, convert flaky
test into e2e test.
2022-10-19 08:22:32 -05:00
Michael Schurter
7b360de940 test: expand timing and debugging for overlap test (#14920)
attempt #9000
2022-10-18 13:02:18 -07:00
Michael Schurter
9cdbbbb583 test: extend timing and output of overlap e2e test (#14894)
Keeps failing in the nightly e2e test with unhelpful output like:
```
Failed
=== RUN   TestOverlap
    overlap_test.go:92: Followup job overlap93ee1d2b blocked. Sleeping for the rest of overlap48c26c39's shutdown_delay (9.2/10s)
    overlap_test.go:105: 1500/2000 retries reached for github.com/hashicorp/nomad/e2e/overlap.TestOverlap (err=timed out before an allocation was found for overlap93ee1d2b)
    overlap_test.go:105: timeout: timed out before an allocation was found for overlap93ee1d2b
--- FAIL: TestOverlap (38.96s)
```

I have not been able to replicate it in my own e2e cluster, so I added
the EvalDump helper to add detailed eval information like:

```
=== RUN   TestOverlap
1/1 Job overlap7b0e90ec Eval c38c9919-a4f0-5baf-45f7-0702383c682a
  Type:         service
  TriggeredBy:  job-register
  Deployment:
  Status:       pending ()
  NextEval:
  PrevEval:
  BlockedEval:
   -- No placement failures --
  QueuedAllocs:
  SnapshotIdx:  0
  CreateIndex:  96
  ModifyIndex:  96

...
```

Hopefully helpful when debugging other tests as well!
2022-10-14 14:15:07 -07:00
Michael Schurter
3282a9c0f7 test: simplify overlap job placement logic (#14811)
* test: simplify overlap job placement logic

Trying to fix #14806

Both the previous approach as well as this one worked on e2e clusters I
spun up.

* simplify code flow
2022-10-12 11:21:28 -07:00
Giovani Avelar
2b9158b73e Allow specification of a custom job name/prefix for parameterized jobs (#14631) 2022-10-06 16:21:40 -04:00
James Rasell
a0b9af4165 e2e: fixes the ordering on greater than checks within spread test. (#14818) 2022-10-06 15:27:36 +02:00
James Rasell
8867791818 e2e: fix incorrect must function usage in namespace suite. (#14805) 2022-10-05 15:50:56 +02:00
Michael Schurter
617f223242 Fixing flaky TestOverlap test (#14780)
* test: ensure feasible node selected in overlap test

* test: warn when getting close to retry limit
2022-10-03 14:35:02 -07:00
Seth Hoenig
32be86831f e2e: convert chroot env unit tests into e2e tests (#14710)
This PR translates two of our most flakey unit tests into
e2e tests where they are fit much more naturally.
2022-09-26 15:40:29 -05:00
Michael Schurter
62d30c06a1 test: add e2e for non-overlapping placements (#14646)
* test: add e2e for non-overlapping placements

Followup to #10446

Fails (as expected) against 1.3.x at the wait for blocked eval (because
the allocs are allowed to overlap).

Passes against 1.4.0-beta.1 (as expected).

* Update e2e/overlap/overlap_test.go

Co-authored-by: James Rasell <jrasell@users.noreply.github.com>
2022-09-22 13:06:17 -07:00
Seth Hoenig
ff1a30fe8d cleanup more helper updates (#14638)
* cleanup: refactor MapStringStringSliceValueSet to be cleaner

* cleanup: replace SliceStringToSet with actual set

* cleanup: replace SliceStringSubset with real set

* cleanup: replace SliceStringContains with slices.Contains

* cleanup: remove unused function SliceStringHasPrefix

* cleanup: fixup StringHasPrefixInSlice doc string

* cleanup: refactor SliceSetDisjoint to use real set

* cleanup: replace CompareSliceSetString with SliceSetEq

* cleanup: replace CompareMapStringString with maps.Equal

* cleanup: replace CopyMapStringString with CopyMap

* cleanup: replace CopyMapStringInterface with CopyMap

* cleanup: fixup more CopyMapStringString and CopyMapStringInt

* cleanup: replace CopySliceString with slices.Clone

* cleanup: remove unused CopySliceInt

* cleanup: refactor CopyMapStringSliceString to be generic as CopyMapOfSlice

* cleanup: replace CopyMap with maps.Clone

* cleanup: run go mod tidy
2022-09-21 14:53:25 -05:00
James Rasell
a801e109b8 e2e: use unique names for Connect ACL Consul policy names. (#14604)
In the event a single test fails to clear up properly after
itself, all other tests will fail as they attempt to create ACL
policies with the same names. This change ensures they use unique
ACL names, so when a single test fails, it is easy to identify it
is a problem with the test rather than the suite.
2022-09-16 13:35:40 +02:00
James Rasell
55e9e8298c e2e: rewrite spread suite to use new e2e style. (#14598)
The rewrite refactors the suite to use the new style along with
other recent testing improvements. In order to ensure the spread
tests do not impact each other, there is new cleanup functionality
to ensure both the job and allocations are removed from state
before the test exits completely.
2022-09-15 17:12:20 +02:00
James Rasell
ea56e11e96 e2e: do not assume clean cluster when checking return objects. (#14557) 2022-09-13 14:25:19 +02:00
James Rasell
23dd15819a acl: fix encoding expiration time in ACL token list API. (#14542) 2022-09-12 15:50:35 +02:00
James Rasell
1db203f416 e2e: fixup service discovery and ACL expiration tests. (#14517)
The NSD checks tests were racey, whereby the check may not have
been triggered by the time it was queried. This change wraps the
check so it can account for this.

This removes the current ACL expiration GC section in order to get
the tests passing and allow more time to investigate the test. I
have full confidence the feature is working as expected and have
tested extensively locally.
2022-09-09 14:27:40 +02:00
James Rasell
22403d357c e2e: fixup token expiration test to account for longer forced GC. (#14491) 2022-09-08 14:43:04 +02:00
James Rasell
c9b6d0e71b e2e: add test to exercise ACL tokens with role and policy links. (#14432) 2022-09-02 08:56:00 +02:00
James Rasell
785b4dfad7 e2e: add acl test for token expiration. (#14418)
In order to add an E2E test to cover token expiration, the server
config has been updated to include a low minimum allowed TTL
value. For ease of reading, the max value is also set.
2022-09-01 09:36:09 +02:00
James Rasell
91e7d8497b e2e: add ACL test suite with ACL Role test. (#14398)
This adds a new ACL test suite to the e2e framework which includes
an initial test for ACL roles. The ACL test includes a helper to
track and clean created Nomad resources which keeps the test
cluster clean no matter if the test fails early or not.
2022-08-31 10:11:28 +02:00
Seth Hoenig
f8beb53471 e2e: add e2e tests for nomad service disco checks
This PR adds 2 e2e tests for ensuring nomad service discovery checks
get created and produce status results as expected.
2022-08-22 15:31:13 -05:00
Piotr Kazmierczak
c4be2c6078 cleanup: replace TypeToPtr helper methods with pointer.Of (#14151)
Bumping compile time requirement to go 1.18 allows us to simplify our pointer helper methods.
2022-08-17 18:26:34 +02:00
Seth Hoenig
0c62f445c3 build: run gofmt on all go source files
Go 1.19 will forecefully format all your doc strings. To get this
out of the way, here is one big commit with all the changes gofmt
wants to make.
2022-08-16 11:14:11 -05:00