fix for:
> This is a scheduled Windows Server 2019 brownout.
> The Windows Server 2019 image will be removed on 2025-06-30.
> For more details, see actions/runner-images#12045
This change creates a reusable workflow for notifying Slack on CI
failures. The message will include useful links and information
about the failure, so product engineers can investigate and fix
any problems.
The new workflow is used by selected workflows which trigger on
merges to main or release/* branches. The notification is only
sent on failure and when the event was a push (PR merge) meaning
the number of notifications should be minimal.
The aim is to help identify and draw attention to failure across
our release branches, in particular when automated processes
happen.
On Windows, if the `raw_exec` driver's executor exits, the child processes are
not also killed. Create a Windows "job object" (not to be confused with a Nomad
job) and add the executor to it. Child processes of the executor will inherit
the job automatically. When the handle to the job object is freed (on executor
exit), the job itself is destroyed and this causes all processes in that job to
exit.
Fixes: https://github.com/hashicorp/nomad/issues/23668
Ref: https://learn.microsoft.com/en-us/windows/win32/procthread/job-objects
When using the Client FS APIs, we check to ensure that reads don't traverse into
the allocation's secret dir and private dir. But this check can be bypassed on
case-insensitive file systems (ex. Windows, macOS, and Linux with obscure ext4
options enabled). This allows a user with `read-fs` permissions but not
`alloc-exec` permissions to read from the secrets dir.
This changeset updates the check so that it's case-insensitive. This risks false
positives for escape (see linked Go issue), but only if a task without
filesystem isolation deliberately writes into the task working directory to do
so, which is a fail-safe failure mode.
Ref: https://github.com/golang/go/issues/18358
Co-authored-by: dduzgun-security <deniz.duzgun@hashicorp.com>
After changes introduced in #23284 we no longer need to make a if
!st.SupportsNUMA() check in the GetNodes() topology method. In fact this check
will now cause panic in nomadTopologyToProto method on systems that don't
support NUMA.
The Nomad client renders templates in the same privileged process used for most
other client operations. During internal testing, we discovered that a malicious
task can create a symlink that can cause template rendering to read and write to
arbitrary files outside the allocation sandbox. Because the Nomad agent can be
restarted without restarting tasks, we can't simply check that the path is safe
at the time we write without encountering a time-of-check/time-of-use race.
To protect Nomad client hosts from this attack, we'll now read and write
templates in a subprocess:
* On Linux/Unix, this subprocess is sandboxed via chroot to the allocation
directory. This requires that Nomad is running as a privileged process. A
non-root Nomad agent will warn that it cannot sandbox the template renderer.
* On Windows, this process is sandboxed via a Windows AppContainer which has
been granted access to only to the allocation directory. This does not require
special privileges on Windows. (Creating symlinks in the first place can be
prevented by running workloads as non-Administrator or
non-ContainerAdministrator users.)
Both sandboxes cause encountered symlinks to be evaluated in the context of the
sandbox, which will result in a "file not found" or "access denied" error,
depending on the platform. This change will also require an update to
Consul-Template to allow callers to inject a custom `ReaderFunc` and
`RenderFunc`.
This design is intended as a workaround to allow us to fix this bug without
creating backwards compatibility issues for running tasks. A future version of
Nomad may introduce a read-only mount specifically for templates and artifacts
so that tasks cannot write into the same location that the Nomad agent is.
Fixes: https://github.com/hashicorp/nomad/issues/19888
Fixes: CVE-2024-1329
We don't run the whole suite of unit tests on all platforms to keep CI times
reasonable, so the only things we've been running on Windows are
platform-specific.
I'm working on some platform-specific `template` related work and having these
tests run on Windows will reduce the risk of regressions. Our Windows CI box
doesn't have Consul or Vault, so I've skipped those tests for the time being,
and can follow up with that later. There's also a test with assertions looking
for specific paths, and the results are different on Windows. I've skipped those
for the moment as well and will follow up under a separate PR.
Also swap `testify` for `shoenig/test`
Trusted Supply Chain Component Registry (TSCCR) enforcement starts Monday and an
internal report shows our semgrep action is pinned to a version that's not
currently permitted. Update all the action versions to whatever's the new
hotness to maximum the time-to-live on these until we have automated pinning
setup.
Also version bumps our chromedriver action, which randomly broke upstream today.
so in enterprise we can use Vault for secrets,
without merge conflicts from oss->ent.
also:
* use hashicorp/setup-golang
* setup-js for self-hosted runners
they don't come with yarn, nor chrome,
and might not always match node version.
namely, these workflows:
test-e2e, test-ui, and test-windows
extra-curricularly, as part of the overall
migration effort company-wide, this also includes
some standardization such as:
* explicit permissions:read on various workflows
* pinned action version shas (per https://github.com/hashicorp/security-public-tsccr)
* actionlint, which among other things runs
shellcheck on GHA run steps
Co-authored-by: emilymianeil <eneil@hashicorp.com>
Co-authored-by: Daniel Kimsey <daniel.kimsey@hashicorp.com>