mirror of
https://github.com/kemko/nomad.git
synced 2026-01-01 16:05:42 +03:00
Nomad creates Consul ACL tokens and service registrations to support Consul service mesh workloads, before bootstrapping the Envoy proxy. Nomad always talks to the local Consul agent and never directly to the Consul servers. But the local Consul agent talks to the Consul servers in stale consistency mode to reduce load on the servers. This can result in the Nomad client making the Envoy bootstrap request with a tokens or services that have not yet replicated to the follower that the local client is connected to. This request gets a 404 on the ACL token and that negative entry gets cached, preventing any retries from succeeding. To workaround this, we'll use a method described by our friends over on `consul-k8s` where after creating the objects in Consul we try to read them from the local agent in stale consistency mode (which prevents a failed read from being cached). This cannot completely eliminate this source of error because it's possible that Consul cluster replication is unhealthy at the time we need it, but this should make Envoy bootstrap significantly more robust. This changset adds preflight checks for the objects we create in Consul: * We add a preflight check for ACL tokens after we login via via Workload Identity and in the function we use to derive tokens in the legacy workflow. We do this check early because we also want to use this token for registering group services in the allocrunner hooks. * We add a preflight check for services right before we bootstrap Envoy in the taskrunner hook, so that we have time for our service client to batch updates to the local Consul agent in addition to the local agent sync. We've added the timeouts to be configurable via node metadata rather than the usual static configuration because for most cases, users should not need to touch or even know these values are configurable; the configuration is mostly available for testing. Fixes: https://github.com/hashicorp/nomad/issues/9307 Fixes: https://github.com/hashicorp/nomad/issues/10451 Fixes: https://github.com/hashicorp/nomad/issues/20516 Ref: https://github.com/hashicorp/consul-k8s/pull/887 Ref: https://hashicorp.atlassian.net/browse/NET-10051 Ref: https://hashicorp.atlassian.net/browse/NET-9273 Follow-up: https://hashicorp.atlassian.net/browse/NET-10138
518 lines
16 KiB
Plaintext
518 lines
16 KiB
Plaintext
---
|
|
layout: docs
|
|
page_title: Consul Service Mesh
|
|
description: >-
|
|
Learn how to use Nomad with Consul service mesh to enable secure service to service
|
|
communication
|
|
---
|
|
|
|
# Consul Service Mesh
|
|
|
|
~> **Note:** Nomad's service mesh integration requires Linux network namespaces.
|
|
Consul service mesh will not run on Windows or macOS.
|
|
|
|
[Consul service mesh](/consul/docs/connect) provides
|
|
service-to-service connection authorization and encryption using mutual
|
|
Transport Layer Security (TLS). Applications can use sidecar proxies in a
|
|
service mesh configuration to automatically establish TLS connections for
|
|
inbound and outbound connections without being aware of the service mesh at all.
|
|
|
|
# Nomad with Consul Service Mesh Integration
|
|
|
|
Nomad integrates with Consul to provide secure service-to-service communication
|
|
between Nomad jobs and task groups. To support Consul service mesh, Nomad
|
|
adds a new networking mode for jobs that enables tasks in the same task group to
|
|
share their networking stack. With a few changes to the job specification, job
|
|
authors can opt into service mesh integration. When service mesh is enabled, Nomad will
|
|
launch a proxy alongside the application in the job file. The proxy (Envoy)
|
|
provides secure communication with other applications in the cluster.
|
|
|
|
Nomad job specification authors can use Nomad's Consul service mesh integration to
|
|
implement [service segmentation](https://www.consul.io/use-cases/multi-platform-service-mesh) in a
|
|
microservice architecture running in public clouds without having to directly
|
|
manage TLS certificates. This is transparent to job specification authors as
|
|
security features in service mesh continue to work even as the application scales up
|
|
or down or gets rescheduled by Nomad.
|
|
|
|
For using the Consul service mesh integration with Consul ACLs enabled, see the
|
|
[Secure Nomad Jobs with Consul Service Mesh](/nomad/tutorials/integrate-consul/consul-service-mesh)
|
|
guide.
|
|
|
|
# Nomad Consul Service Mesh Example
|
|
|
|
The following section walks through an example to enable secure communication
|
|
between a web dashboard and a backend counting service. The web dashboard and
|
|
the counting service are managed by Nomad. Nomad additionally configures Envoy
|
|
proxies to run along side these applications. The dashboard is configured to
|
|
connect to the counting service via localhost on port 9001. The proxy is managed
|
|
by Nomad, and handles mTLS communication to the counting service.
|
|
|
|
## Prerequisites
|
|
|
|
### Consul
|
|
|
|
The Consul service mesh integration with Nomad requires [Consul 1.6 or
|
|
later.](https://releases.hashicorp.com/consul/1.6.0/) The Consul agent can be
|
|
run in dev mode with the following command:
|
|
|
|
~> **Note:** Nomad's Consul service mesh integration requires Consul in your `$PATH`
|
|
|
|
```shell-session
|
|
$ consul agent -dev
|
|
```
|
|
|
|
To use service mesh on a non-dev Consul agent, you will minimally need to enable the
|
|
GRPC port and set `connect` to enabled by adding some additional information to
|
|
your Consul client configurations, depending on format. Consul agents running TLS
|
|
and a version greater than [1.14.0](https://releases.hashicorp.com/consul/1.14.0)
|
|
should set the `grpc_tls` configuration parameter instead of `grpc`. Please see
|
|
the Consul [port documentation](https://developer.hashicorp.com/consul/docs/install/ports) for further reference material.
|
|
|
|
For HCL configurations:
|
|
|
|
```hcl
|
|
# ...
|
|
|
|
ports {
|
|
grpc = 8502
|
|
}
|
|
|
|
connect {
|
|
enabled = true
|
|
}
|
|
```
|
|
|
|
For JSON configurations:
|
|
|
|
```javascript
|
|
{
|
|
// ...
|
|
"ports": {
|
|
"grpc": 8502
|
|
},
|
|
"connect": {
|
|
"enabled": true
|
|
}
|
|
}
|
|
```
|
|
|
|
#### Consul TLS
|
|
|
|
~> **Note:** Consul 1.14+ made a [backwards incompatible change][consul_grpc_tls]
|
|
in how TLS enabled grpc listeners work. When using Consul 1.14 with TLS enabled users
|
|
will need to specify additional Nomad agent configuration to work with Connect. The
|
|
`consul.grpc_ca_file` value must now be configured (introduced in Nomad 1.4.4),
|
|
and `consul.grpc_address` will most likely need to be set to use the new standard
|
|
`grpc_tls` port of `8503`.
|
|
|
|
```hcl
|
|
consul {
|
|
grpc_ca_file = "/etc/tls/consul-agent-ca.pem"
|
|
grpc_address = "127.0.0.1:8503"
|
|
ca_file = "/etc/tls/consul-agent-ca.pem"
|
|
cert_file = "/etc/tls/dc1-client-consul-0.pem"
|
|
key_file = "/etc/tls/dc1-client-consul-0-key.pem"
|
|
ssl = true
|
|
address = "127.0.0.1:8501"
|
|
}
|
|
```
|
|
|
|
#### Consul ACLs
|
|
|
|
~> **Note:** Starting in Nomad v1.3.0, Consul Service Identity ACL tokens automatically
|
|
generated by Nomad on behalf of Connect enabled services are now created in [`Local`]
|
|
rather than Global scope, and are no longer replicated globally.
|
|
|
|
To facilitate cross-Consul datacenter requests of Connect services registered by
|
|
Nomad, Consul agents will need to be configured with [default anonymous][anon_token]
|
|
ACL tokens with ACL policies of sufficient permissions to read service and node
|
|
metadata pertaining to those requests. This mechanism is described in Consul [#7414][consul_acl].
|
|
A typical Consul agent anonymous token may contain an ACL policy such as:
|
|
|
|
```hcl
|
|
service_prefix "" { policy = "read" }
|
|
node_prefix "" { policy = "read" }
|
|
```
|
|
|
|
#### Transparent Proxy
|
|
|
|
Using Nomad's support for [transparent proxy][] configures the task group's
|
|
network namespace so that traffic flows through the Envoy proxy. When the
|
|
[`transparent_proxy`][] block is enabled:
|
|
|
|
* Nomad will invoke the [`consul-cni`][] CNI plugin to configure `iptables` rules
|
|
in the network namespace to force outbound traffic from an allocation to flow
|
|
through the proxy.
|
|
* If the local Consul agent is serving DNS, Nomad will set the IP address of the
|
|
Consul agent as the nameserver in the task's `/etc/resolv.conf`.
|
|
* Consul will provide a [virtual IP][] for any upstream service the workload
|
|
has access to, based on the service intentions.
|
|
|
|
Using transparent proxy has several important requirements:
|
|
|
|
* You must have the [`consul-cni`][] CNI plugin installed on the client host
|
|
along with the usual [required CNI plugins][cni_plugins].
|
|
* To use Consul DNS and virtual IPs, you will need to configure Consul's DNS
|
|
listener to be exposed to the workload network namespace. You can do this
|
|
without exposing the Consul agent on a public IP by setting the Consul
|
|
`bind_addr` to bind on a private IP address (the default is to use the
|
|
`client_addr`).
|
|
* The Consul agent must be configured with [`recursors`][] if you want
|
|
allocations to make DNS queries for applications outside the service mesh.
|
|
* You cannot set a [`network.dns`][] block on the allocation (unless you set
|
|
[`no_dns`][tproxy_no_dns], see below).
|
|
|
|
For example, a HCL configuration with a [go-sockaddr/template][] binding to the
|
|
subnet `10.37.105.0/20`, with recursive DNS set to OpenDNS nameservers:
|
|
|
|
```hcl
|
|
bind_addr = "{{ GetPrivateInterfaces | include \"network\" \"10.37.105.0/20\" | limit 1 | attr \"address\" }}"
|
|
|
|
recursors = ["208.67.222.222", "208.67.220.220"]
|
|
```
|
|
|
|
### Nomad
|
|
|
|
Nomad must schedule onto a routable interface in order for the proxies to
|
|
connect to each other. The following steps show how to start a Nomad dev agent
|
|
configured for Consul service mesh.
|
|
|
|
```shell-session
|
|
$ sudo nomad agent -dev-connect
|
|
```
|
|
|
|
### CNI Plugins
|
|
|
|
Nomad uses CNI reference plugins to configure the network namespace used to secure the
|
|
Consul service mesh sidecar proxy. All Nomad client nodes using network namespaces
|
|
must have these CNI plugins [installed][cni_install].
|
|
|
|
To use [`transparent_proxy`][] mode, Nomad client nodes will also need the
|
|
[`consul-cni`][] plugin installed. See the Linux post-installation [steps](/nomad/docs/install#post-installation-steps) for more detail on how to install CNI plugins.
|
|
|
|
## Run the Service Mesh-enabled Services
|
|
|
|
Once Nomad and Consul are running, with Consul DNS enabled for transparent proxy
|
|
mode as described above, submit the following service mesh-enabled services to
|
|
Nomad by copying the HCL into a file named `servicemesh.nomad.hcl` and running:
|
|
`nomad job run servicemesh.nomad.hcl`
|
|
|
|
```hcl
|
|
job "countdash" {
|
|
datacenters = ["dc1"]
|
|
|
|
group "api" {
|
|
network {
|
|
mode = "bridge"
|
|
}
|
|
|
|
service {
|
|
name = "count-api"
|
|
port = "9001"
|
|
|
|
connect {
|
|
sidecar_service {
|
|
proxy {
|
|
transparent_proxy {}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
|
|
task "web" {
|
|
driver = "docker"
|
|
|
|
config {
|
|
image = "hashicorpdev/counter-api:v3"
|
|
}
|
|
}
|
|
}
|
|
|
|
group "dashboard" {
|
|
network {
|
|
mode = "bridge"
|
|
|
|
port "http" {
|
|
static = 9002
|
|
to = 9002
|
|
}
|
|
}
|
|
|
|
service {
|
|
name = "count-dashboard"
|
|
port = "http"
|
|
|
|
connect {
|
|
sidecar_service {
|
|
proxy {
|
|
transparent_proxy {}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
|
|
task "dashboard" {
|
|
driver = "docker"
|
|
|
|
env {
|
|
COUNTING_SERVICE_URL = "http://count-api.virtual.consul"
|
|
}
|
|
|
|
config {
|
|
image = "hashicorpdev/counter-dashboard:v3"
|
|
}
|
|
}
|
|
}
|
|
}
|
|
```
|
|
|
|
The job contains two task groups: an API service and a web frontend.
|
|
|
|
### API Service
|
|
|
|
The API service is defined as a task group with a bridge network:
|
|
|
|
```hcl
|
|
group "api" {
|
|
network {
|
|
mode = "bridge"
|
|
}
|
|
|
|
# ...
|
|
}
|
|
```
|
|
|
|
Since the API service is only accessible via Consul service mesh, it does not
|
|
define any ports in its network. The `connect` block enables the service mesh
|
|
and the `transparent_proxy` block ensures that the service will be reachable via
|
|
a virtual IP address when used with Consul DNS.
|
|
|
|
```hcl
|
|
group "api" {
|
|
|
|
# ...
|
|
|
|
service {
|
|
name = "count-api"
|
|
port = "9001"
|
|
|
|
connect {
|
|
sidecar_service {
|
|
proxy {
|
|
transparent_proxy {}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
|
|
# ...
|
|
|
|
}
|
|
```
|
|
|
|
The `port` in the service block is the port the API service listens on. The
|
|
Envoy proxy will automatically route traffic to that port inside the network
|
|
namespace. Note that currently this cannot be a named port; it must be a
|
|
hard-coded port value. See [GH-9907].
|
|
|
|
### Web Frontend
|
|
|
|
The web frontend is defined as a task group with a bridge network and a static
|
|
forwarded port:
|
|
|
|
```hcl
|
|
group "dashboard" {
|
|
network {
|
|
mode = "bridge"
|
|
|
|
port "http" {
|
|
static = 9002
|
|
to = 9002
|
|
}
|
|
}
|
|
|
|
# ...
|
|
|
|
}
|
|
```
|
|
|
|
The `static = 9002` parameter requests the Nomad scheduler reserve port 9002 on
|
|
a host network interface. The `to = 9002` parameter forwards that host port to
|
|
port 9002 inside the network namespace.
|
|
|
|
This allows you to connect to the web frontend in a browser by visiting
|
|
`http://<host_ip>:9002` as show below:
|
|
|
|
[![Count Dashboard][count-dashboard]][count-dashboard]
|
|
|
|
The web frontend connects to the API service via Consul service mesh.
|
|
|
|
```hcl
|
|
service {
|
|
name = "count-dashboard"
|
|
port = "http"
|
|
|
|
connect {
|
|
sidecar_service {
|
|
proxy {
|
|
transparent_proxy {}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
```
|
|
|
|
The `connect` block with `transparent_proxy` configures the web frontend's
|
|
network namespace to route all access to the `count-api` service through the
|
|
Envoy proxy.
|
|
|
|
The web frontend is configured to communicate with the API service with an
|
|
environment variable `$COUNTING_SERVICE_URL`:
|
|
|
|
```hcl
|
|
env {
|
|
COUNTING_SERVICE_URL = "http://count-api.virtual.consul"
|
|
}
|
|
```
|
|
|
|
The `transparent_proxy` block ensures that DNS queries are made to Consul so
|
|
that the `count-api.virtual.consul` name resolves to a virtual IP address. Note
|
|
that you don't need to specify a port number because the virtual IP will only be
|
|
directed to the correct service port.
|
|
|
|
### Manually Configured Upstreams
|
|
|
|
You can also use Connect without Consul DNS and `transparent_proxy` mode. This
|
|
approach is not recommended because it requires duplicating service intention
|
|
information in an `upstreams` block in the Nomad job specification. But Consul
|
|
DNS is not protected by ACLs, so you might want to do this if you don't want to
|
|
expose Consul DNS to untrusted workloads.
|
|
|
|
In that case, you can add `upstream` blocks to the job spec. You don't need the
|
|
`transparent_proxy` block for the `count-api` service:
|
|
|
|
```hcl
|
|
group "api" {
|
|
|
|
# ...
|
|
|
|
service {
|
|
name = "count-api"
|
|
port = "9001"
|
|
|
|
connect {
|
|
sidecar_service {}
|
|
}
|
|
}
|
|
|
|
# ...
|
|
|
|
}
|
|
```
|
|
|
|
But you'll need to add an `upstreams` block to the `count-dashboard` service:
|
|
|
|
```hcl
|
|
service {
|
|
name = "count-dashboard"
|
|
port = "http"
|
|
|
|
connect {
|
|
sidecar_service {
|
|
proxy {
|
|
upstreams {
|
|
destination_name = "count-api"
|
|
local_bind_port = 8080
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
```
|
|
|
|
The `upstreams` block defines the remote service to access (`count-api`) and
|
|
what port to expose that service on inside the network namespace (`8080`).
|
|
|
|
The web frontend will also need to use an environment variable to communicate
|
|
with the API service:
|
|
|
|
```hcl
|
|
env {
|
|
COUNTING_SERVICE_URL = "http://${NOMAD_UPSTREAM_ADDR_count_api}"
|
|
}
|
|
```
|
|
|
|
This environment variable value gets interpolated with the upstream's
|
|
address. Note that dashes (`-`) are converted to underscores (`_`) in
|
|
environment variables so `count-api` becomes `count_api`.
|
|
|
|
## Limitations
|
|
|
|
- The minimum Consul version to use Connect with Nomad is Consul v1.8.0.
|
|
- The `consul` binary must be present in Nomad's `$PATH` to run the Envoy
|
|
proxy sidecar on client nodes.
|
|
- Consul service mesh using network namespaces is only supported on Linux.
|
|
- Prior to Consul 1.9, the Envoy sidecar proxy will drop and stop accepting
|
|
connections while the Nomad agent is restarting.
|
|
|
|
## Troubleshooting
|
|
|
|
If the sidecar service is not running correctly, you can investigate
|
|
potential `envoy` failures in the following ways:
|
|
|
|
* Task logs in the associated `connect-*` task
|
|
* Task secrets (may contain sensitive information):
|
|
* envoy CLI command: `secrets/.envoy_bootstrap.cmd`
|
|
* environment variables: `secrets/.envoy_bootstrap.env`
|
|
* An extra Allocation log file: `alloc/logs/envoy_bootstrap.stderr.0`
|
|
|
|
For example, with an allocation ID starting with `b36a`:
|
|
|
|
```shell-session
|
|
nomad alloc status -short b36a # to get the connect-* task name
|
|
nomad alloc logs -task connect-proxy-count-api -stderr b36a
|
|
nomad alloc exec -task connect-proxy-count-api b36a cat secrets/.envoy_bootstrap.cmd
|
|
nomad alloc exec -task connect-proxy-count-api b36a cat secrets/.envoy_bootstrap.env
|
|
nomad alloc fs b36a alloc/logs/envoy_bootstrap.stderr.0
|
|
```
|
|
|
|
Note: If the alloc is unable to start successfully, debugging files may
|
|
only be accessible from the host filesystem. However, the sidecar task secrets
|
|
directory may not be available in systems where it is mounted in a temporary
|
|
filesystem.
|
|
|
|
Bootstrapping the Envoy proxy requires that the Consul ACL token and service
|
|
registration have successfully replicated to whichever Consul server the local
|
|
Consul agent is connected to. Nomad clients poll for this value with exponential
|
|
backoff and a timeout. You can adjust the timeouts on a given node by setting
|
|
node metadata values via the command line or in the [`client.meta`][] agent
|
|
configuration block. The default values are shown below:
|
|
|
|
```shell-session
|
|
nomad node meta apply -node-id $nodeID \
|
|
consul.token_preflight_check.timeout=10s \
|
|
consul.token_preflight_check.base=500ms \
|
|
consul.service_preflight_check.timeout=60s \
|
|
consul.service_preflight_check.base=1s
|
|
```
|
|
|
|
[count-dashboard]: /img/count-dashboard.png
|
|
[consul_acl]: https://github.com/hashicorp/consul/issues/7414
|
|
[gh-9907]: https://github.com/hashicorp/nomad/issues/9907
|
|
[`Local`]: /consul/docs/security/acl/acl-tokens#token-attributes
|
|
[anon_token]: /consul/docs/security/acl/acl-tokens#special-purpose-tokens
|
|
[consul_ports]: /consul/docs/agent/config/config-files#ports
|
|
[consul_grpc_tls]: /consul/docs/upgrading/upgrade-specific#changes-to-grpc-tls-configuration
|
|
[cni_install]: /nomad/docs/install#post-installation-steps
|
|
[transparent proxy]: /consul/docs/k8s/connect/transparent-proxy
|
|
[go-sockaddr/template]: https://pkg.go.dev/github.com/hashicorp/go-sockaddr/template
|
|
[`recursors`]: /consul/docs/agent/config/config-files#recursors
|
|
[`transparent_proxy`]: /nomad/docs/job-specification/transparent_proxy
|
|
[tproxy_no_dns]: /nomad/docs/job-specification/transparent_proxy#no_dns
|
|
[`consul-cni`]: https://releases.hashicorp.com/consul-cni
|
|
[virtual IP]: /consul/docs/services/discovery/dns-static-lookups#service-virtual-ip-lookups
|
|
[cni_plugins]: /nomad/docs/networking/cni#cni-reference-plugins
|
|
[consul_dns_port]: /consul/docs/agent/config/config-files#dns_port
|
|
[`network.dns`]: /nomad/docs/job-specification/network#dns-parameters
|
|
[`client.meta`]: /nomad/docs/configuration/client#meta
|