Files
nomad/website/content/docs/deploy/nomad-agent.mdx
Aimee Ukasick 53b083b8c5 Docs: Nomad IA (#26063)
* Move commands from docs to its own root-level directory

* temporarily use modified dev-portal branch with nomad ia changes

* explicitly clone nomad ia exp branch

* retrigger build, fixed dev-portal broken build

* architecture, concepts and get started individual pages

* fix get started section destinations

* reference section

* update repo comment in website-build.sh to show branch

* docs nav file update capitalization

* update capitalization to force deploy

* remove nomad-vs-kubernetes dir; move content to what is nomad pg

* job section

* Nomad operations category, deploy section

* operations category, govern section

* operations - manage

* operations/scale; concepts scheduling fix

* networking

* monitor

* secure section

* remote auth-methods folder and move up pages to sso; linkcheck

* Fix install2deploy redirects

* fix architecture redirects

* Job section: Add missing section index pages

* Add section index pages so breadcrumbs build correctly

* concepts/index fix front matter indentation

* move task driver plugin config to new deploy section

* Finish adding full URL to tutorials links in nav

* change SSO to Authentication in nav and file system

* Docs NomadIA: Move tutorials into NomadIA branch (#26132)

* Move governance and policy from tutorials to docs

* Move tutorials content to job-declare section

* run jobs section

* stateful workloads

* advanced job scheduling

* deploy section

* manage section

* monitor section

* secure/acl and secure/authorization

* fix example that contains an unseal key in real format

* remove images from sso-vault

* secure/traffic

* secure/workload-identities

* vault-acl change unseal key and root token in command output sample

* remove lines from sample output

* fix front matter

* move nomad pack tutorials to tools

* search/replace /nomad/tutorials links

* update acl overview with content from deleted architecture/acl

* fix spelling mistake

* linkcheck - fix broken links

* fix link to Nomad variables tutorial

* fix link to Prometheus tutorial

* move who uses Nomad to use cases page; move spec/config shortcuts

add dividers

* Move Consul out of Integrations; move namespaces to govern

* move integrations/vault to secure/vault; delete integrations

* move ref arch to docs; rename Deploy Nomad back to Install Nomad

* address feedback

* linkcheck fixes

* Fixed raw_exec redirect

* add info from /nomad/tutorials/manage-jobs/jobs

* update page content with newer tutorial

* link updates for architecture sub-folders

* Add redirects for removed section index pages. Fix links.

* fix broken links from linkcheck

* Revert to use dev-portal main branch instead of nomadIA branch

* build workaround: add intro-nav-data.json with single entry

* fix content-check error

* add intro directory to get around Vercel build error

* workound for emtpry directory

* remove mdx from /intro/ to fix content-check and git snafu

* Add intro index.mdx so Vercel build should work

---------

Co-authored-by: Tu Nguyen <im2nguyen@gmail.com>
2025-07-08 19:24:52 -05:00

193 lines
8.6 KiB
Plaintext

---
layout: docs
page_title: Operate a Nomad agent
description: |-
The Nomad agent is a long running process which can be used either in
a client or server mode.
---
# Operate a Nomad agent
A Nomad agent is a long running process that runs on every machine in your Nomad
cluster. The behavior of the agent depends on if it is running in client or
server mode. Clients run tasks, while servers manage the cluster.
Server agents are part of the [consensus protocol](/nomad/docs/architecture/cluster/consensus) and
[gossip protocol](/nomad/docs/architecture/security/gossip). The consensus protocol, powered
by Raft, lets the servers perform leader election and state replication.
The gossip protocol allows for server clustering and multi-region federation.
The higher burden on the server nodes means that you should run them on
dedicated instances because the servers are more resource intensive than a
client node.
Client agents use fingerprinting to determine the capabilities and resources of
the host machine, as well as what [drivers](/nomad/docs/job-declare/task-driver) are available.
Clients register with servers to provide node information and a heartbeat.
Clients run tasks that the server assigns to them. Client nodes make up the
majority of the cluster and are very lightweight. They interface with the server
nodes and maintain very little state of their own. Each cluster has usually 3 or
5 server agents and potentially thousands of clients.
## Run an agent
Start the agent with the [`nomad agent` command](/nomad/commands/agent).
This command blocks, running forever or until told to quit. The `nomad agent`
command takes a variety of configuration options, but most have sane defaults.
<Note title="Linux Users">
You must run client agents as root, or with `sudo`, so that cpuset accounting
and network namespaces work correctly.
</Note>
This example starts the agent in development mode, which means the agents runs
as both the server and the client. Do not use `-dev` in a production environment.
```shell-session
$ sudo nomad agent -dev
==> Starting Nomad agent...
==> Nomad agent configuration:
Client: true
Log Level: INFO
Region: global (DC: dc1)
Server: true
==> Nomad agent started! Log data will stream in below:
[INFO] serf: EventMemberJoin: server-1.node.global 127.0.0.1
[INFO] nomad: starting 4 scheduling worker(s) for [service batch _core]
...
```
The `nomad agent` command outputs the following important information:
- **Client**: This indicates whether the agent is running as a client.
Client nodes fingerprint their host environment, register with servers,
and run tasks.
- **Log Level**: This indicates the configured log level. Nomad logs only
messages with an equal or higher severity.You may turn change the log level to
increase verbosity for debugging or reduce to avoid noisy logging.
- **Region**: This is the region and datacenter in which the agent runs. Nomad
has first-class support for multi-datacenter and multi-region configurations.
Use the `-region` and `-dc` flags to set the region and datacenter. The
default is the `global` region in `dc1`.
- **Server**: This indicates whether the agent is running as a server. Server
nodes have the extra burden of participating in the consensus protocol,
storing cluster state, and making scheduling decisions.
## Stop an agent
By default, any stop signal, such as interrupt or terminate, causes the
agent to exit after ensuring its internal state is written to disk as
needed. You can configure additional behaviors by setting shutdown
[`leave_on_interrupt`][] or [`leave_on_terminate`][] to respond to the
respective signals.
For servers, when you set `leave_on_interrupt` or `leave_on_terminate`, the
servers notify other servers of their intention to leave the cluster, which
allows them to leave the [consensus][] peer set. It is especially important that
a server node be allowed to leave gracefully so that there is a minimal
impact on availability as the server leaves the consensus peer set. If a server
does not gracefully leave, and will not return into service, use the [`server
force-leave` command][] to eject that server from the consensus peer set.
For clients, when you set `leave_on_interrupt` or `leave_on_terminate` and the
client is configured with [`drain_on_shutdown`][], the client drains its
workloads before shutting down.
## Signal handling
In addition to the optional handling of interrupt (`SIGINT`) and terminate
signals (`SIGTERM`) described in the [Stop an agent
section](#stop-an-agent), Nomad supports special behavior for several other
signals useful for debugging.
* `SIGHUP` causes Nomad to [reload its configuration][].
* `SIGUSR1` causes Nomad to print its [metrics][] without stopping the
agent.
* `SIGQUIT`, `SIGILL`, `SIGTRAP`, `SIGABRT`, `SIGSTKFLT`, `SIGEMT`, or `SIGSYS`
signals are handled by the Go runtime. These the Nomad agent to exit
and print its stack trace.
When using the official HashiCorp packages on Linux, you can send these signals
via `systemctl`.
This example outputs the Nomad agent's metrics.
```shell-session
$ sudo systemctl kill nomad -s SIGUSR1
```
You can then read those metrics in the service logs:
```shell-session
$ journalctl -u nomad
```
## Lifecycle
Every agent in the Nomad cluster goes through a lifecycle. Understanding
this lifecycle is useful for building a mental model of an agent's interactions
with a cluster and how the cluster treats a node.
When a client agent starts, it fingerprints the host machine to identify its
attributes, capabilities, and [task drivers](/nomad/docs/job-declare/task-driver). The client
then reports this information to the servers during an initial registration. You
provide the addresses of known servers to the agent via configuration,
potentially using DNS for resolution. Use [Consul](https://www.consul.io/)
to avoid hard coding addresses and instead resolve them on demand.
While a client is running, it sends heartbeats to servers to maintain liveness.
If the heartbeats fail, the servers assume the client node has failed. The
server then stops assigning new tasks and migrates existing tasks. It is
impossible to distinguish between a network failure and an agent crash, so Nomad
handles both
cases in the same way. Once the network recovers or a crashed agent
restarts, Nomad updates the node status and resumes normal operation.
To prevent an accumulation of nodes in a terminal state, Nomad does periodic
garbage collection of nodes. By default, if a node is in a failed or 'down'
state for over 24 hours, Nomad garbage collects that node.
Servers are slightly more complex since they perform additional functions. They
participate in a [gossip protocol](/nomad/docs/architecture/security/gossip) both to cluster
within a region and to support multi-region configurations. When a server starts, it does not know the address of other servers in the cluster.
To discover its peers, it must join the cluster. You do this with the
[`server join` command](/nomad/commands/server/join) or by providing the
proper configuration on start. Once a node joins, this information is gossiped
to the entire cluster, meaning all nodes will eventually be aware of each other.
When a server leaves, it specifies its intent to do so, and the cluster marks
that node as having left the cluster. If the server has left, replication to it
stops, and it is removed from the consensus peer set. If the server has failed,
replication attempts to make progress to recover from a software or network
failure.
## Permissions
Nomad servers and Nomad clients have different requirements for permissions.
Run Nomad servers with the lowest possible permissions. The servers
need access to their own data directory and the ability to bind to their ports.
You should create a `nomad` user with the minimal set of required privileges.
Run Nomad clients as `root` due to the OS isolation mechanisms that
require root privileges. While it is possible to run Nomad as an unprivileged
user, you must do careful testing to ensure the task drivers and features
you use function as expected. The Nomad client's data directory should be
owned by `root` with filesystem permissions set to `0700`.
[`leave_on_interrupt`]: /nomad/docs/configuration#leave_on_interrupt
[`leave_on_terminate`]: /nomad/docs/configuration#leave_on_terminate
[`server force-leave` command]: /nomad/commands/server/force-leave
[consensus]: /nomad/docs/architecture/cluster/consensus
[`drain_on_shutdown`]: /nomad/docs/configuration/client#drain_on_shutdown
[reload its configuration]: /nomad/docs/configuration#configuration-reload
[metrics]: /nomad/docs/reference/metrics