mirror of
https://github.com/kemko/nomad.git
synced 2026-01-05 01:45:44 +03:00
* Move commands from docs to its own root-level directory * temporarily use modified dev-portal branch with nomad ia changes * explicitly clone nomad ia exp branch * retrigger build, fixed dev-portal broken build * architecture, concepts and get started individual pages * fix get started section destinations * reference section * update repo comment in website-build.sh to show branch * docs nav file update capitalization * update capitalization to force deploy * remove nomad-vs-kubernetes dir; move content to what is nomad pg * job section * Nomad operations category, deploy section * operations category, govern section * operations - manage * operations/scale; concepts scheduling fix * networking * monitor * secure section * remote auth-methods folder and move up pages to sso; linkcheck * Fix install2deploy redirects * fix architecture redirects * Job section: Add missing section index pages * Add section index pages so breadcrumbs build correctly * concepts/index fix front matter indentation * move task driver plugin config to new deploy section * Finish adding full URL to tutorials links in nav * change SSO to Authentication in nav and file system * Docs NomadIA: Move tutorials into NomadIA branch (#26132) * Move governance and policy from tutorials to docs * Move tutorials content to job-declare section * run jobs section * stateful workloads * advanced job scheduling * deploy section * manage section * monitor section * secure/acl and secure/authorization * fix example that contains an unseal key in real format * remove images from sso-vault * secure/traffic * secure/workload-identities * vault-acl change unseal key and root token in command output sample * remove lines from sample output * fix front matter * move nomad pack tutorials to tools * search/replace /nomad/tutorials links * update acl overview with content from deleted architecture/acl * fix spelling mistake * linkcheck - fix broken links * fix link to Nomad variables tutorial * fix link to Prometheus tutorial * move who uses Nomad to use cases page; move spec/config shortcuts add dividers * Move Consul out of Integrations; move namespaces to govern * move integrations/vault to secure/vault; delete integrations * move ref arch to docs; rename Deploy Nomad back to Install Nomad * address feedback * linkcheck fixes * Fixed raw_exec redirect * add info from /nomad/tutorials/manage-jobs/jobs * update page content with newer tutorial * link updates for architecture sub-folders * Add redirects for removed section index pages. Fix links. * fix broken links from linkcheck * Revert to use dev-portal main branch instead of nomadIA branch * build workaround: add intro-nav-data.json with single entry * fix content-check error * add intro directory to get around Vercel build error * workound for emtpry directory * remove mdx from /intro/ to fix content-check and git snafu * Add intro index.mdx so Vercel build should work --------- Co-authored-by: Tu Nguyen <im2nguyen@gmail.com>
215 lines
12 KiB
Plaintext
215 lines
12 KiB
Plaintext
---
|
|
layout: docs
|
|
page_title: Garbage collection
|
|
description: |-
|
|
Nomad garbage collects Access Control List (ACL) tokens, allocations, deployments, encryption root keys, evaluations, jobs, nodes, plugins, and Container Storage Interface (CSI) volumes. Learn about server-side and client-side garbage collection processes, including configuration and triggers.
|
|
---
|
|
|
|
# Garbage collection
|
|
|
|
Nomad garbage collection is not the same as garbage collection in a programming
|
|
language, but the motivation behind its design is similar: garbage collection
|
|
frees up memory allocated for objects that the schedular no longer references or
|
|
needs. Nomad only garbage collects objects that are in a terminal state and only
|
|
after a delay to allow inspection or debugging.
|
|
|
|
Nomad runs garbage collection processes on servers and on client nodes. You may
|
|
also manually trigger garbage collection on the server.
|
|
|
|
Nomad garbage collects the following objects:
|
|
|
|
- [ACL token](#configuration)
|
|
- [Allocation](#client-side-garbage-collection)
|
|
- [CSI Plugin](#configuration)
|
|
- [Deployment](#configuration)
|
|
- [Encryption root key](#configuration)
|
|
- [Evaluation](#configuration)
|
|
- [Job](#configuration)
|
|
- [Node](#configuration)
|
|
- [Volume](#configuration)
|
|
|
|
## Cascading garbage collection
|
|
|
|
Nomad's scheduled garbage collection processes generally handle each resource
|
|
type independently. However, there is an implicit cascading relationship because
|
|
of how objects reference each other. In practice, when Nomad garbage collects a
|
|
higher-level object, Nomad also removes the object's associated sub-objects to
|
|
prevent orphaned objects.
|
|
|
|
For example, garbage collecting a job also causes Nomad to drop all of that
|
|
job's remaining evaluations, deployments, and allocation records from the state.
|
|
Nomad garbage collects those objects, either as part of the job garbage
|
|
collection process or by each object's own garbage collection processes running
|
|
immediately after. Nomad's scheduled garbage collection processes only garbage
|
|
collect objects after they are terminal for at least the specified time
|
|
threshold and no longer needed for future scheduling decisions. Note that when
|
|
you force garbage collection by running the `nomad system gc` command, Nomad
|
|
ignores the specified time threshold.
|
|
|
|
## Server-side garbage collection
|
|
|
|
The Nomad server leader starts periodic garbage collection processes that clean
|
|
objects marked for garbage collection from memory. Nomad automatically marks
|
|
some objects, like evaluations, for garbage collection. Alternatively, you may
|
|
manually mark jobs for garbage collection by running `nomad system gc`, which
|
|
runs the garbage collection process.
|
|
|
|
### Configuration
|
|
|
|
These settings govern garbage collection behavior on the server nodes. You may
|
|
review the intervals in the [`config.go`
|
|
class](https://github.com/hashicorp/nomad/blob/b11619010e1c83488e14e2785569e515b2769062/nomad/config.go#L564)
|
|
for objects without a configurable interval setting.
|
|
|
|
| Object | Interval | Threshold |
|
|
|---|---|---|
|
|
| **ACL token** | 5 minutes | [`acl_token_gc_threshold`](/nomad/docs/configuration/server#acl_token_gc_threshold)<br/>Default: 1 hour |
|
|
| **CSI Plugin** | 5 minutes | [`csi_plugin_gc_threshold`](/nomad/docs/configuration/server#csi_plugin_gc_threshold)<br/>Default: 1 hour |
|
|
| **Deployment** | 5 minutes | [`deployment_gc_threshold`](/nomad/docs/configuration/server#deployment_gc_threshold)<br/>Default: 1 hour |
|
|
| **Encryption root key** | [`root_key_gc_interval`](/nomad/docs/configuration/server#root_key_gc_interval)<br/>Default: 10 minutes | [`root_key_gc_threshold`](/nomad/docs/configuration/server#root_key_gc_threshold)<br/>Default: 1 hour |
|
|
| **Evaluation** | 5 minutes | [`eval_gc_threshold`](/nomad/docs/configuration/server#eval_gc_threshold) <br/>Default: 1 hour |
|
|
| **Evaluation, batch** | 5 minutes | [`batch_eval_gc_threshold`](/nomad/docs/configuration/server#batch_eval_gc_threshold)<br/>Default: 24 hours |
|
|
| **Job** | [`job_gc_interval`](/nomad/docs/configuration/server#job_gc_interval)<br/>Default: 5 minutes | [`job_gc_threshold`](/nomad/docs/configuration/server#job_gc_threshold)<br/>Default: 4 hours |
|
|
| **Node** | 5 minutes | [`node_gc_threshold`](/nomad/docs/configuration/server#node_gc_threshold)<br/>Default: 24 hours |
|
|
| **Volume** | [`csi_volume_claim_gc_interval`](/nomad/docs/configuration/server#csi_volume_claim_gc_interval)<br/>Default: 5 minutes| [`csi_volume_claim_gc_threshold`](/nomad/docs/configuration/server#csi_volume_claim_gc_threshold)<br/>Default: 1 hour |
|
|
|
|
### Triggers
|
|
|
|
The server garbage collection processes wake up at configured intervals to scan
|
|
for any expired or terminal objects to permanently delete, provided the object's
|
|
time in a terminal state exceeds its garbage collection threshold. For example,
|
|
a job's default garbage collection threshold is four hours, so the job must be
|
|
in a terminal state for at least four hours before the garbage collection
|
|
process permanently deletes the job and its dependent objects.
|
|
|
|
When you force garbage collection by manually running the `nomad system gc`
|
|
command, you are telling the garbage collection process to ignore thresholds and
|
|
immediately purge all terminal objects on all servers and clients.
|
|
|
|
## Client-side garbage collection
|
|
|
|
On each client node, Nomad must clean up resources from terminated allocations
|
|
to free disk and memory on the machine.
|
|
|
|
### Configuration
|
|
|
|
These settings govern allocation garbage collection behavior on each client node.
|
|
|
|
| Parameter | Default | Description |
|
|
| -------- | ------- | ------------- |
|
|
| [`gc_interval`](/nomad/docs/configuration/client#gc_interval) | 1 minute | Interval at which Nomad attempts to garbage collect terminal allocation directories |
|
|
| [ `gc_disk_usage_threshold` ](/nomad/docs/configuration/client#gc_disk_usage_threshold) | 80 | Disk usage percent which Nomad tries to maintain by garbage collecting terminal allocations |
|
|
| [ `gc_inode_usage_threshold` ](/nomad/docs/configuration/client#gc_inode_usage_threshold) | 70 | Inode usage percent which Nomad tries to maintain by garbage collecting terminal allocations |
|
|
| [ `gc_max_allocs` ](/nomad/docs/configuration/client#gc_max_allocs) | 50 | Maximum number of allocations which a client will track before triggering a garbage collection of terminal allocations |
|
|
| [ `gc_parallel_destroys ` ](/nomad/docs/configuration/client#gc_parallel_destroys) | 2 | Maximum number of parallel destroys allowed by the garbage collector |
|
|
|
|
Refer to the [client block in agent configuration
|
|
reference](/nomad/docs/configuration/client) for complete parameter descriptions
|
|
and examples.
|
|
|
|
Note that there is no time-based retention setting for allocations. Unlike jobs
|
|
or evaluations, you cannot specify a time to keep allocations alive before
|
|
garbage collection. As soon as an allocation is terminal, it becomes eligible
|
|
for cleanup if the configured thresholds demand it.
|
|
|
|
### Triggers
|
|
|
|
Nomad's client runs allocation garbage collection based on these triggers:
|
|
|
|
- Scheduled interval
|
|
|
|
The garbage collection process launches a ticker based on the configured
|
|
`gc_interval`. On each tick, the garbage collection process checks to see if it needs to remove terminal allocations.
|
|
|
|
- Terminal state
|
|
|
|
When an allocation transitions to a terminal state, Nomad marks
|
|
the allocation for garbage collection and then signals the garbage collection
|
|
process to run immediately.
|
|
|
|
- Allocation placement
|
|
|
|
Nomad may preemptively run garbage collection to make room for new
|
|
allocations. The client garbage collects older, terminal allocations if adding new allocations would exceed the `gc_max_allocs` limit.
|
|
|
|
- Forced garbage collection
|
|
|
|
When you force garbage collection by running the `nomad system gc` command,
|
|
the garbage collection process removes all terminal objects on all servers and
|
|
clients, ignoring thresholds.
|
|
|
|
Nomad does not continuously monitor disk or inode usage to trigger garbage
|
|
collection. Instead, Nomad only checks disk and inode thresholds when one of the
|
|
aforementioned triggers invokes the garbage collection process. The
|
|
`gc_inode_usage_threshold` and `gc_disk_usage_threshold` values do not trigger
|
|
garbage collection; rather, those values influence how the garbage collector
|
|
behaves during a collection run.
|
|
|
|
### Allocation selection
|
|
|
|
When the garbage collection process runs, Nomad destroys as many finished
|
|
allocations as needed to meet the resource thresholds. The client maintains a
|
|
priority queue of terminal allocations ordered by the time they were marked
|
|
finished, oldest first.
|
|
|
|
The process repeatedly evicts allocations from the queue until the conditions
|
|
are back within configured limits. Specifically, the garbage collection loop
|
|
checks, in order:
|
|
|
|
1. If disk usage exceeds `gc_disk_usage_threshold` value
|
|
1. If inode usage exceeds `gc_inode_usage_threshold` value
|
|
1. If the count of allocations exceeds `gc_max_allocs` value
|
|
|
|
If any one of these conditions is true, the garbage collector selects the oldest
|
|
finished allocation for removal.
|
|
|
|
After deleting one allocation, the loop re-checks the metrics and continues
|
|
removing the next-oldest allocation until all thresholds are satisfied or
|
|
until there are no more terminal allocs. This means in a single run, the
|
|
garbage collection removes multiple allocations back-to-back if the node was
|
|
far over the limits. The evictions happen in termination-time order, which is
|
|
oldest completed allocations first.
|
|
|
|
If node's usage and allocation count are under the limits, a normal garbage
|
|
collection cycle does not remove any allocations. In other words, periodic and
|
|
event-driven garbage collection does not delete allocations just because they
|
|
are finished. There has to be pressure or a limit reached. The exception is when
|
|
an administrative command or server-side removal triggers client-side garbage
|
|
collection. Aside from that forced scenario, the default behavior is
|
|
threshold-driven: Nomad leaves allocations on disk until it needs to reclaim
|
|
those allocations due to space, inode, or count limits being hit.
|
|
|
|
### Task driver resources garbage collection
|
|
|
|
Most task drivers do not have their own garbage collection process. When an
|
|
allocation is terminal, the client garbage collection process communicates with
|
|
the task driver to ensure the task's resources have been cleaned up. Note that
|
|
the Docker task driver periodically cleans up its own resources. Refer to the
|
|
[Docker task driver plugin
|
|
options](https://developer.hashicorp.com/nomad/docs/deploy/task-driver/docker#gc) for
|
|
details.
|
|
|
|
When a task has configured restart attempts and the task fails, the Nomad client
|
|
attempts an in-place task restart within the same allocation. The task driver
|
|
starts a new process or container for the task. If the task continues to fail
|
|
and exceeds the configured restart attempts, Nomad terminates the task and marks
|
|
the allocation as terminal. The task driver then cleans up its resources, such
|
|
as a Docker container or cgroups. When the garbage collection process runs, it
|
|
makes sure that the task driver cleanup is done before deleting the allocation.
|
|
If a task driver fails to clean up properly, Nomad logs errors but continues the
|
|
garbage collection process. Task driver cleanup failure issues can influence
|
|
when the allocation truly frees up. For instance, if volumes are not detached,
|
|
disk space might not be fully reclaimed until fixed.
|
|
|
|
## Resources
|
|
|
|
- [Nomad's internal garbage collection and optimization discovery during the
|
|
Nomad Bench project blog post](https://www.hashicorp.com/en/blog/nomad-garbage-collection-optimization-discovery-during-nomad-bench)
|
|
- Configuration
|
|
|
|
- [client Block in Agent Configuration](/nomad/docs/configuration/client)
|
|
- [server Block in Agent Configuration](/nomad/docs/configuration/server)
|
|
|
|
- [the `nomad system gc` command reference](/nomad/commands/system/gc)
|
|
- [System HTTP API Force GC](/nomad/api-docs/system#force-gc)
|