Commit Graph

24817 Commits

Author SHA1 Message Date
Phil Renaud
d06dfd2abc Node Pools moved to after Type in jobs index columns (#17738) 2023-06-26 17:00:01 -04:00
Seth Hoenig
ec4fa55bbf drivers/docker: refactor use of clients in docker driver (#17731)
* drivers/docker: refactor use of clients in docker driver

This PR refactors how we manage the two underlying clients used by the
docker driver for communicating with the docker daemon. We keep two clients
- one with a hard-coded timeout that applies to all operations no matter
what, intended for use with short lived / async calls to docker. The other
has no timeout and is the responsibility of the caller to set a context
that will ensure the call eventually terminates.

The use of these two clients has been confusing and mistakes were made
in a number of places where calls were making use of the wrong client.

This PR makes it so that a user must explicitly call a function to get
the client that makes sense for that use case.

Fixes #17023

* cr: followup items
2023-06-26 15:21:42 -05:00
sejalapeno
05c84d64d2 Update allocations.go (#17726)
* Update allocations.go

updated missing client status "unknown" #17688

* changelog

* Update .changelog/17726.txt

adding relevant desc.

Co-authored-by: Seth Hoenig <shoenig@duck.com>

---------

Co-authored-by: Seth Hoenig <shoenig@duck.com>
2023-06-26 13:33:29 -05:00
nicoche
a9135bc6d5 deploymentwatcher: fail early whenever possible (#17341)
Given a deployment that has a `progress_deadline`, if a task group runs
out of reschedule attempts, allow it to fail at this time instead of
waiting until the `progress_deadline` is reached.

Fixes: #17260
2023-06-26 14:01:03 -04:00
Phil Renaud
d20faf5855 [ui] alignment and spacing for job status panel (#17708)
* CSS alignment and spacing for job status panel

* Only fade the count, not the legend icon, when count is 0

* Unrounded version corners

* changelog

* css has to only remove border radius when count is present

* Seed stabilization for services test

* Try consolidating the testfixes from before

* Total test isolation and bonus logs

* Drop the isolation but keep the logs

* Remove bonus logging
2023-06-26 12:18:12 -04:00
hashicorp-copywrite[bot]
d778ecfc7d [COMPLIANCE] Add Copyright and License Headers (#17732)
Co-authored-by: hashicorp-copywrite[bot] <110428419+hashicorp-copywrite[bot]@users.noreply.github.com>
2023-06-26 11:11:17 -05:00
dependabot[bot]
f4385cc817 build(deps): bump github.com/containerd/go-cni from 1.1.7 to 1.1.9 (#17582) 2023-06-26 16:47:20 +01:00
James Rasell
09cc4b12bc test: add drain config tests. (#17724) 2023-06-26 16:23:13 +01:00
Seth Hoenig
37df529e7a e2e: refactor pids isolation tests (#17717)
This PR refactors some old PID isolation tests to make use of the e2e/v3
packages. Should be quite a bit easier to read. Adds 'alloc exec' capability
to the jobs3 package.
2023-06-26 09:51:18 -05:00
Tim Gross
78f4f76520 adjust prioritized client updates (#17541)
In #17354 we made client updates prioritized to reduce client-to-server
traffic. When the client has no previously-acknowledged update we assume that
the update is of typical priority; although we don't know that for sure in
practice an allocation will never become healthy quickly enough that the first
update we send is the update saying the alloc is healthy.

But that doesn't account for allocations that quickly fail in an unrecoverable
way because of allocrunner hook failures, and it'd be nice to be able to send
those failure states to the server more quickly. This changeset does so and adds
some extra comments on reasoning behind priority.
2023-06-26 09:14:24 -04:00
dependabot[bot]
1d4f8869fd build(deps): bump github.com/opencontainers/runtime-spec (#17719)
Bumps [github.com/opencontainers/runtime-spec](https://github.com/opencontainers/runtime-spec) from 1.0.3-0.20210326190908-1c3f411f0417 to 1.1.0-rc.3.
- [Release notes](https://github.com/opencontainers/runtime-spec/releases)
- [Changelog](https://github.com/opencontainers/runtime-spec/blob/main/ChangeLog)
- [Commits](https://github.com/opencontainers/runtime-spec/commits/v1.1.0-rc.3)

---
updated-dependencies:
- dependency-name: github.com/opencontainers/runtime-spec
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2023-06-26 08:03:50 -05:00
Piotr Kazmierczak
807907e001 chore: gofmt docker driver handle.go (#17721) 2023-06-26 10:38:23 +02:00
Johan Forssell
5b46b74b94 drivers: OOM kill logging for Docker driver (#17518)
Explicit error log of the docker ID and container image name
2023-06-26 10:13:23 +02:00
Tim Gross
555214199a cli: fix broken node pool jobs test (#17715)
In #17705 we fixed a bug in the treatment of the "all" node pool for the `node
pool jobs` command but missed a test in the CLI.
2023-06-23 14:10:45 -07:00
Tim Gross
fc611fc5f4 docs: clarify drain's -force flag behavior with system/CSI jobs (#17703)
If you use `nomad node drain -force`, the drain deadline is set to -1ns. If you
have not prevented system and CSI node plugin allocations from being drained
with `-ignore-system`, they will be immediately drained as well. This is
typically not safe for CSI node plugins.

Also fix some broken links.

Fixes: #17696
2023-06-23 16:38:11 -04:00
Luiz Aoqui
276c69bffd api: prevent panic on job plan (#17689)
Check for a nil job ID to prevent a panic when calling Jobs().Plan().
2023-06-23 16:20:52 -04:00
Luiz Aoqui
b7c2d65a0e build: add Docker image (#17017)
Co-authored-by: Daniel Kimsey <90741+dekimsey@users.noreply.github.com>
2023-06-23 15:57:09 -04:00
Luiz Aoqui
aea6146656 np: fix list of jobs for node pool all (#17705)
Unlike nodes, jobs are allowed to be registered in the node pool `all`,
in which case all nodes are used for evaluating placements. When listing
jobs for the `all` node pool only those that are explicitly in this node
pool should be returned.
2023-06-23 15:47:53 -04:00
Luiz Aoqui
6b70896480 changelog: add entry for node pools (#17707) 2023-06-23 15:47:35 -04:00
Tim Gross
8dca3632ff docs: split out unsupported versions in changelog (#17704)
Our changelog has become large enough that GitHub's rendering is very slow,
resulting in error pages ("angry unicorns"). Split out the older unsupported
versions of Nomad into their own file so that we only need to render the most
recent versions, while keeping the older versions relatively searchable by
having them in a single file.
2023-06-23 15:17:57 -04:00
grembo
6f04b91912 Add disable_file parameter to job's vault stanza (#13343)
This complements the `env` parameter, so that the operator can author
tasks that don't share their Vault token with the workload when using 
`image` filesystem isolation. As a result, more powerful tokens can be used 
in a job definition, allowing it to use template stanzas to issue all kinds of 
secrets (database secrets, Vault tokens with very specific policies, etc.), 
without sharing that issuing power with the task itself.

This is accomplished by creating a directory called `private` within
the task's working directory, which shares many properties of
the `secrets` directory (tmpfs where possible, not accessible by
`nomad alloc fs` or Nomad's web UI), but isn't mounted into/bound to the
container.

If the `disable_file` parameter is set to `false` (its default), the Vault token
is also written to the NOMAD_SECRETS_DIR, so the default behavior is
backwards compatible. Even if the operator never changes the default,
they will still benefit from the improved behavior of Nomad never reading
the token back in from that - potentially altered - location.
2023-06-23 15:15:04 -04:00
Michael Lange
ed5160514e Merge pull request #17691 from hashicorp/f/missing-chart-stories
[UI] Missing chart stories
2023-06-23 08:17:34 -07:00
James Rasell
ced6c5890e client: remove unused nsd check allocation result diff func (#17695) 2023-06-23 15:26:06 +01:00
Seth Hoenig
5b5fbc0881 e2e: create a v3/ set of packages for creating Nomad e2e tests (#17620)
* e2e: create a v3/ set of packages for creating Nomad e2e tests

This PR creates an experimental set of packages under `e2e/v3/` for crafting
Nomad e2e tests. Unlike previous generations, this is an attempt at providing
a way to create tests in a declarative (ish) pattern, with a focus on being
easy to use, easy to cleanup, and easy to debug.

@shoenig is just trying this out to see how it goes.

Lots of features need to be implemented.
Many more docs need to be written.
Breaking changes are to be expected.
There are known and unknown bugs.
No warranty.

Quick run of `example` with verbose logging.

```shell
➜ NOMAD_E2E_VERBOSE=1 go test -v
=== RUN   TestExample
=== RUN   TestExample/testSleep
    util3.go:25: register (service) job: "sleep-809"
    util3.go:25: checking eval: 9f0ae04d-7259-9333-3763-44d0592d03a1, status: pending
    util3.go:25: checking eval: 9f0ae04d-7259-9333-3763-44d0592d03a1, status: complete
    util3.go:25: checking deployment: a85ad2f8-269c-6620-d390-8eac7a9c397d, status: running
    util3.go:25: checking deployment: a85ad2f8-269c-6620-d390-8eac7a9c397d, status: running
    util3.go:25: checking deployment: a85ad2f8-269c-6620-d390-8eac7a9c397d, status: running
    util3.go:25: checking deployment: a85ad2f8-269c-6620-d390-8eac7a9c397d, status: running
    util3.go:25: checking deployment: a85ad2f8-269c-6620-d390-8eac7a9c397d, status: successful
    util3.go:25: deployment a85ad2f8-269c-6620-d390-8eac7a9c397d was a success
    util3.go:25: deregister job "sleep-809"
    util3.go:25: system gc
=== RUN   TestExample/testNamespace
    util3.go:25: apply namespace "example-291"
    util3.go:25: register (service) job: "sleep-967"
    util3.go:25: checking eval: a2a2303a-adf1-2621-042e-a9654292e569, status: pending
    util3.go:25: checking eval: a2a2303a-adf1-2621-042e-a9654292e569, status: complete
    util3.go:25: checking deployment: 3395e9a8-3ffc-8990-d5b8-cc0ce311f302, status: running
    util3.go:25: checking deployment: 3395e9a8-3ffc-8990-d5b8-cc0ce311f302, status: running
    util3.go:25: checking deployment: 3395e9a8-3ffc-8990-d5b8-cc0ce311f302, status: running
    util3.go:25: checking deployment: 3395e9a8-3ffc-8990-d5b8-cc0ce311f302, status: successful
    util3.go:25: deployment 3395e9a8-3ffc-8990-d5b8-cc0ce311f302 was a success
    util3.go:25: deregister job "sleep-967"
    util3.go:25: system gc
    util3.go:25: cleanup namespace "example-291"
=== RUN   TestExample/testEnv
    util3.go:25: register (batch) job: "env-582"
    util3.go:25: checking eval: 600f3bce-ea17-6d13-9d20-9d9eb2a784f7, status: pending
    util3.go:25: checking eval: 600f3bce-ea17-6d13-9d20-9d9eb2a784f7, status: complete
    util3.go:25: deregister job "env-582"
    util3.go:25: system gc
--- PASS: TestExample (10.08s)
    --- PASS: TestExample/testSleep (5.02s)
    --- PASS: TestExample/testNamespace (4.02s)
    --- PASS: TestExample/testEnv (1.03s)
PASS
ok      github.com/hashicorp/nomad/e2e/example  10.079s
```

* cluster3: use filter for kernel.name instead of filtering manually
2023-06-23 09:10:49 -05:00
James Rasell
6f7dc6a860 server: remove unused endpoints struct. (#17665) 2023-06-23 08:20:33 +01:00
Luiz Aoqui
2dec8baac1 ci: fix flaky UI test (#17676) 2023-06-22 23:07:36 -04:00
Michael Lange
2e2e396d26 TopoViz story that is sourced from Mirage
Unfortunately due to the split build nature of the ember app and
storybook it isn't possible to import mirage in the storybook context to
control scenarios via a knob :(
2023-06-22 16:55:36 -07:00
Michael Lange
e7952b6e25 Full TopoViz story 2023-06-22 16:55:25 -07:00
Michael Lange
ea713212bb TopoViz child component stories 2023-06-22 15:03:32 -07:00
Michael Lange
7e1516b436 Standard usage story 2023-06-22 13:53:21 -07:00
Michael Lange
b4a72b5641 Basic recommendation-chart story with knobs 2023-06-22 13:53:21 -07:00
Phil Renaud
908ce47c1b [ui] Versions added to deploying status panel (#17629)
* Versions added to deploying status panel

* Wrap the running and healthy title in a span

* Versions in the deployment UI next to titles

* Version count and label styles updated
2023-06-22 16:19:41 -04:00
Tim Gross
c3d81578f1 release pipeline: fix ref arguments in invoking workflow (#17684)
Although #17669 fixed the permissions of the release pipeline to push new
commits, there was still an error when invoking the `build` workflow.

The format of the reference was changed in #17103 such that we're sending the
git ref (a SHA) and not the "--ref" argument required by the GH actions workflow
API, which in this case is apparently specially defined as "The branch or tag
name which contains the version of the workflow file you'd like to run" and not
what git calls a "ref".

This changeset:
* Removes the third-party action entirely so that we're using GitHub's own
  tooling. This removes one more thing from the supply chain to pin and ensures a
  1:1 mapping of args to what's documented by GitHub.
* Removes the `--ref` argument entirely, which causes it to default to the
  current branch that the release workflow is running on (which is always what
  we want).
2023-06-22 15:33:19 -04:00
Phil Renaud
e617962251 [ui, deployments] job status panel legend: counts of 0 don't get links (#17644) 2023-06-22 14:40:11 -04:00
Luiz Aoqui
13ee343853 core: remove unnecessary call to SetNodes and adds DC downgrade test (#17655) 2023-06-22 13:26:14 -04:00
Jai
4da63e3ded ui: create node pool model (#17301)
Co-authored-by: Phil Renaud <phil@riotindustries.com>
Co-authored-by: Luiz Aoqui <luiz@hashicorp.com>
2023-06-22 13:11:44 -04:00
Luiz Aoqui
085a0b0883 np: check for license on RPC endpoints (#17656) 2023-06-22 12:52:20 -04:00
Luiz Aoqui
717e1567bb ci: set continue-on-error: true on test-ui (#17646)
Since the matrix exercises different test cases, it's better to allow
all partitions to completely run, even if one of them fails, so it's
easier to catch multiple test failures.
2023-06-22 11:31:49 -04:00
Tim Gross
deae9bb62e client: send node secret with every client-to-server RPC (#16799)
In Nomad 1.5.3 we fixed a security bug that allowed bypass of ACL checks if the
request came thru a client node first. But this fix broke (knowingly) the
identification of many client-to-server RPCs. These will be now measured as if
they were anonymous. The reason for this is that many client-to-server RPCs do
not send the node secret and instead rely on the protection of mTLS.

This changeset ensures that the node secret is being sent with every
client-to-server RPC request. In a future version of Nomad we can add
enforcement on the server side, but this was left out of this changeset to
reduce risks to the safe upgrade path.

Sending the node secret as an auth token introduces a new problem during initial
introduction of a client. Clients send many RPCs concurrently with
`Node.Register`, but until the node is registered the node secret is unknown to
the server and will be rejected as invalid. This causes permission denied
errors.

To fix that, this changeset introduces a gate on having successfully made a
`Node.Register` RPC before any other RPCs can be sent (except for `Status.Ping`,
which we need earlier but which also ignores the error because that handler
doesn't do an authorization check). This ensures that we only send requests with
a node secret already known to the server. This also makes client startup a
little easier to reason about because we know `Node.Register` must succeed
first, and it should make for a good place to hook in future plans for secure
introduction of nodes. The tradeoff is that an existing client that has running
allocs will take slightly longer (a second or two) to transition to ready after
a restart, because the transition in `Node.UpdateStatus` is gated at the server
by first submitting `Node.UpdateAlloc` with client alloc updates.
2023-06-22 11:06:49 -04:00
Luiz Aoqui
28206f7210 ci: fix some flaky UI tests (#17648)
These tests would fail depending on the value of the seed used.
2023-06-22 10:51:07 -04:00
Tim Gross
b23fe72fb5 release pipeline: release workflow needs write permissions (#17669)
In #17103 we set read-only permissions on all the workflows. Unfortunately we
missed that the `release` workflow makes git commits and pushes them to the
repository, so it needs to have write permissions.
2023-06-22 10:40:45 -04:00
Luiz Aoqui
876117fadd ui: display mirage scenario in header label (#17649)
This information is useful when switching between different scenarios
for testing.
2023-06-22 10:38:17 -04:00
Seth Hoenig
33ac5ed1df client: do not disable memory swappiness if kernel does not support it (#17625)
* client: do not disable memory swappiness if kernel does not support it

This PR adds a workaround for very old Linux kernels which do not support
the memory swappiness interface file. Normally we write a "0" to the file
to explicitly disable swap. In the case the kernel does not support it,
give libcontainer a nil value so it does not write anything.

Fixes #17448

* client: detect swappiness by writing to the file

* fixup changelog

Co-authored-by: James Rasell <jrasell@users.noreply.github.com>

---------

Co-authored-by: James Rasell <jrasell@users.noreply.github.com>
2023-06-22 09:36:31 -05:00
Luiz Aoqui
df37f2d022 ui: add tooltips to the Topology labels (#17647)
Add tooltips to labels in nodes and datacenters for the Topology view
page to clarify what each value represents.
2023-06-22 10:33:42 -04:00
Luiz Aoqui
a29048b3e7 ui: remove redundant columns from child job table (#17645)
Namespace, job type, and priority are already available from the parent
job header, so displaying them in the table caused it to be too crowded.
2023-06-22 10:22:41 -04:00
James Rasell
b197a9ee26 variables: remove unused state store functions. (#17660) 2023-06-22 13:54:58 +01:00
James Rasell
c19253215b core: use faster concatenation for alloc name generation. (#17591) 2023-06-22 07:46:28 +01:00
Luiz Aoqui
f4c7182873 node pools: apply node pool scheduler configuration (#17598) 2023-06-21 20:31:50 -04:00
Phil Renaud
fe49f22247 Moves to the current LTS release of Node for our build and release workflows (#17639) 2023-06-21 15:17:24 -04:00
Phil Renaud
873acf04b9 [ui] General status for steady-state jobs (#17599)
* Degraded vs Healthy etc. status

* Standardize the look of a deploying status panel

* badge styles

* remove job.status from title component in favour of in-panel status

* Remove a redundant check

* re-attrd fail-deployment button considered
2023-06-21 11:57:28 -04:00