Merge pull request #1373 from hashicorp/d-operating-job

Documentation for operating a job
This commit is contained in:
Alex Dadgar
2016-07-01 11:18:58 -07:00
committed by GitHub
11 changed files with 685 additions and 4 deletions

View File

@@ -180,6 +180,8 @@ nodes, unless otherwise specified:
automatically bootstrap itself using Consul. For more details see the [`consul`
section](#consul_options).
<a id="telemetry_config"></a>
* `telemetry`: Used to control how the Nomad agent exposes telemetry data to
external metrics collection servers. This is a key/value mapping and supports
the following keys:

View File

@@ -22,6 +22,9 @@ getting a better view of what Nomad is doing.
Telemetry information can be streamed to both [statsite](https://github.com/armon/statsite)
as well as statsd based on providing the appropriate configuration options.
To configure the telemetry output please see the [agent
configuration](/docs/agent/config.html#telemetry_config).
Below is sample output of a telemetry dump:
```text

View File

@@ -0,0 +1,15 @@
---
layout: "docs"
page_title: "Operating a Job"
sidebar_current: "docs-jobops"
description: |-
Learn how to operate a Nomad Job.
---
# Operating a Job
Once a job has been submitted to Nomad, users must be able to inspect the state
of tasks, understand resource usage and access task logs. Further, for services,
performing zero downtime updates is critical. This section provides some best
practices and guidance for operating jobs under Nomad. Please navigate the
appropriate sub-sections for more information.

View File

@@ -0,0 +1,171 @@
---
layout: "docs"
page_title: "Operating a Job: Inspecting State"
sidebar_current: "docs-jobops-inspection"
description: |-
Learn how to inspect a Nomad Job.
---
# Inspecting state
Once a job is submitted, the next step is to ensure it is running. This section
will assume we have submitted a job with the name _example_.
To get a high-level over view of our job we can use the [`nomad status`
command](/docs/commands/status.html). This command will display the list of
running allocations, as well as any recent placement failures. An example below
shows that the job has some allocations placed but did not have enough resources
to place all of the desired allocations. We run with `-evals` to see that there
is an outstanding evaluation for the job:
```
$ nomad status example
ID = example
Name = example
Type = service
Priority = 50
Datacenters = dc1
Status = running
Periodic = false
Evaluations
ID Priority Triggered By Status Placement Failures
5744eb15 50 job-register blocked N/A - In Progress
8e38e6cf 50 job-register complete true
Placement Failure
Task Group "cache":
* Resources exhausted on 1 nodes
* Dimension "cpu exhausted" exhausted on 1 nodes
Allocations
ID Eval ID Node ID Task Group Desired Status
12681940 8e38e6cf 4beef22f cache run running
395c5882 8e38e6cf 4beef22f cache run running
4d7c6f84 8e38e6cf 4beef22f cache run running
843b07b8 8e38e6cf 4beef22f cache run running
a8bc6d3e 8e38e6cf 4beef22f cache run running
b0beb907 8e38e6cf 4beef22f cache run running
da21c1fd 8e38e6cf 4beef22f cache run running
```
In the above example we see that the job has a "blocked" evaluation that is in
progress. When Nomad can not place all the desired allocations, it creates a
blocked evaluation that waits for more resources to become available. We can use
the [`eval-status` command](/docs/commands/eval-status.html) to examine any
evaluation in more detail. For the most part this should never be necessary but
can be useful to see why all of a job's allocations were not placed. For
example if we run it on the _example_ job, which had a placement failure
according to the above output, we see:
```
nomad eval-status 8e38e6cf
ID = 8e38e6cf
Status = complete
Status Description = complete
Type = service
TriggeredBy = job-register
Job ID = example
Priority = 50
Placement Failures = true
Failed Placements
Task Group "cache" (failed to place 3 allocations):
* Resources exhausted on 1 nodes
* Dimension "cpu exhausted" exhausted on 1 nodes
Evaluation "5744eb15" waiting for additional capacity to place remainder
```
More interesting though is the [`alloc-status`
command](/docs/commands/alloc-status.html). This command gives us the most
recent events that occured for a task, its resource usage, port allocations and
more:
```
nomad alloc-status 12
ID = 12681940
Eval ID = 8e38e6cf
Name = example.cache[1]
Node ID = 4beef22f
Job ID = example
Client Status = running
Task "redis" is "running"
Task Resources
CPU Memory Disk IOPS Addresses
2/500 6.3 MiB/256 MiB 300 MiB 0 db: 127.0.0.1:57161
Recent Events:
Time Type Description
06/28/16 15:46:42 UTC Started Task started by client
06/28/16 15:46:10 UTC Restarting Task restarting in 30.863215327s
06/28/16 15:46:10 UTC Terminated Exit Code: 137, Exit Message: "Docker container exited with non-zero exit code: 137"
06/28/16 15:37:46 UTC Started Task started by client
06/28/16 15:37:44 UTC Received Task received by client
```
In the above example we forced killed the Docker container so that we could see
in the event history that Nomad detected the failure and restarted the
allocation.
The `alloc-status` command is a good starting to point for debugging an
application that did not start. In this example task we are trying to start a
redis image using `redis:2.8` but the user has accidentally put a comma instead
of a period, typing `redis:2,8`.
When the job is run, it produces an allocation that fails. The `alloc-status`
command gives us the reason why:
```
nomad alloc-status c0f1
ID = c0f1b34c
Eval ID = 4df393cb
Name = example.cache[0]
Node ID = 13063955
Job ID = example
Client Status = failed
Task "redis" is "dead"
Task Resources
CPU Memory Disk IOPS Addresses
500 256 MiB 300 MiB 0 db: 127.0.0.1:23285
Recent Events:
Time Type Description
06/28/16 15:50:22 UTC Not Restarting Error was unrecoverable
06/28/16 15:50:22 UTC Driver Failure failed to create image: Failed to pull `redis:2,8`: API error (500): invalid tag format
06/28/16 15:50:22 UTC Received Task received by client
```
Not all failures are this easily debuggable. If the `alloc-status` command shows
many restarts occuring as in the example below, it is a good hint that the error
is occuring at the application level during start up. These failres can be
debugged by looking at logs which is covered in the [Nomad Job Logging
documentation](/docs/jobops/logs.html).
```
$ nomad alloc-status e6b6
ID = e6b625a1
Eval ID = 68b742e8
Name = example.cache[0]
Node ID = 83ef596c
Job ID = example
Client Status = pending
Task "redis" is "pending"
Task Resources
CPU Memory Disk IOPS Addresses
500 256 MiB 300 MiB 0 db: 127.0.0.1:30153
Recent Events:
Time Type Description
06/28/16 15:56:16 UTC Restarting Task restarting in 5.178426031s
06/28/16 15:56:16 UTC Terminated Exit Code: 1, Exit Message: "Docker container exited with non-zero exit code: 1"
06/28/16 15:56:16 UTC Started Task started by client
06/28/16 15:56:00 UTC Restarting Task restarting in 5.00123931s
06/28/16 15:56:00 UTC Terminated Exit Code: 1, Exit Message: "Docker container exited with non-zero exit code: 1"
06/28/16 15:55:59 UTC Started Task started by client
06/28/16 15:55:48 UTC Received Task received by client
```

View File

@@ -0,0 +1,93 @@
---
layout: "docs"
page_title: "Operating a Job: Accessing Logs"
sidebar_current: "docs-jobops-logs"
description: |-
Learn how to operate a Nomad Job.
---
# Accessing Logs
Accessing applications logs is critical when debugging issues, performance
problems or even for verifying the application is starting correctly. To make
this as simple as possible, Nomad provides both a CLI tool and an API for
accessing application logs and data files.
To see this in action we can just run the example job which created using `nomad
init`:
```
$ nomad init
Example job file written to example.nomad
```
This job will start a redis instance in a Docker container. We can run it now:
```
$ nomad run example.nomad
==> Monitoring evaluation "7a3b78c0"
Evaluation triggered by job "example"
Allocation "c3c58508" created: node "b5320e2d", group "cache"
Evaluation status changed: "pending" -> "complete"
==> Evaluation "7a3b78c0" finished with status "complete"
```
We can grab the allocation ID from above and use the [`nomad fs`
command](/docs/commands/fs.html) to access the applications logs. Logs are
stored under the following directory structure:
`alloc/logs/<task-name>.<stdout/stderr>.<index>`. Nomad has built in log
rotation, documented in the [Jobspec](/docs/jobspec/index.html#log_rotation).
The index is a monotonically increasing number starting at zero and incremented
each time the log is rotated.
Thus to access the `stdout` we can issue the below command:
```
$ nomad fs c3c58508 alloc/logs/redis.stdout.0
_._
_.-``__ ''-._
_.-`` `. `_. ''-._ Redis 3.2.1 (00000000/0) 64 bit
.-`` .-```. ```\/ _.,_ ''-._
( ' , .-` | `, ) Running in standalone mode
|`-._`-...-` __...-.``-._|'` _.-'| Port: 6379
| `-._ `._ / _.-' | PID: 1
`-._ `-._ `-./ _.-' _.-'
|`-._`-._ `-.__.-' _.-'_.-'|
| `-._`-._ _.-'_.-' | http://redis.io
`-._ `-._`-.__.-'_.-' _.-'
|`-._`-._ `-.__.-' _.-'_.-'|
| `-._`-._ _.-'_.-' |
`-._ `-._`-.__.-'_.-' _.-'
`-._ `-.__.-' _.-'
`-._ _.-'
`-.__.-'
1:M 28 Jun 19:49:30.504 # WARNING: The TCP backlog setting of 511 cannot be enforced because /proc/sys/net/core/somaxconn is set to the lower value of 128.
1:M 28 Jun 19:49:30.505 # Server started, Redis version 3.2.1
1:M 28 Jun 19:49:30.505 # WARNING overcommit_memory is set to 0! Background save may fail under low memory condition. To fix this issue add 'vm.overcommit_memory = 1' to /etc/sysctl.conf and then reboot or run the command 'sysctl vm.overcommit_memory=1' for this to take effect.
1:M 28 Jun 19:49:30.505 # WARNING you have Transparent Huge Pages (THP) support enabled in your kernel. This will create latency and memory usage issues with Redis. To fix this issue run the command 'echo never > /sys/kernel/mm/transparent_hugepage/enabled' as root, and add it to your /etc/rc.local in order to retain the setting after a reboot. Redis must be restarted after THP is disabled.
1:M 28 Jun 19:49:30.505 * The server is now ready to accept connections on port 6379
```
Replacing `stdout` for `stderr` would display the respective `stderr` output.
While this works well for quickly accessing logs, we recommend running a
log-shipper for long term storage of logs. In many cases this will not be needed
and the above will suffice but for use cases in which log retention is needed
Nomad can accomodate.
Since we place application logs inside the `alloc/` directory, all tasks within
the same task group have access to each others logs. Thus we can have a task
group as follows:
```
group "my-group" {
task "log-producer" {...}
task "log-shipper" {...}
}
```
In the above example, the `log-producer` task is the application that should be
run and will be producing the logs we would like to ship and the `log-shipper`
reads these logs from the `alloc/logs/` directory and ships them to a long term
storage such as S3.

View File

@@ -0,0 +1,72 @@
---
layout: "docs"
page_title: "Operating a Job: Resource Utilization"
sidebar_current: "docs-jobops-resource-utilization"
description: |-
Learn how to see resource utilization of a Nomad Job.
---
# Determing Resource Utilization
Understanding the resource utilization of your application is important for many
reasons and Nomad supports reporting detailed statistics in many of its drivers.
The main interface for seeing resource utilization is with the [`alloc-status`
command](/docs/commands/alloc-status.html) by specifying the `-stats` flag.
In the below example we are running `redis` and can see its resource utilization
below:
```
$ nomad alloc-status c3e0
ID = c3e0e3e0
Eval ID = 617e5e39
Name = example.cache[0]
Node ID = 39acd6e0
Job ID = example
Client Status = running
Task "redis" is "running"
Task Resources
CPU Memory Disk IOPS Addresses
957/1000 30 MiB/256 MiB 300 MiB 0 db: 127.0.0.1:34907
Memory Stats
Cache Max Usage RSS Swap
32 KiB 79 MiB 30 MiB 0 B
CPU Stats
Percent Throttled Periods Throttled Time
73.66% 0 0
Recent Events:
Time Type Description
06/28/16 16:43:50 UTC Started Task started by client
06/28/16 16:42:42 UTC Received Task received by client
```
Here we can see that we are near the limit of our configured CPU but we have
plenty of memory headroom. We can use this information to alter our job's
resources to better reflect is actually needs:
```
resource {
cpu = 2000
memory = 100
}
```
Adjusting resources is very important for a variety of reasons:
* Ensuring your application does not get OOM killed if it hits its memory limit.
* Ensuring the application performs well by ensuring it has some CPU allowance.
* Optimizing cluster density by reserving what you need and not over-allocating.
While single point in time resource usage measurements are useful, it is often
more useful to graph resource usage over time to better understand and estimate
resource usage. Nomad supports outputting resource data to statsite and statsd
and is the recommended way of monitoring resources. For more information about
outputing telemetry see the [Telemetry documentation](/docs/agent/telemetry.html).
For more advanced use cases, the resource usage data may also be accessed via
the client's HTTP API. See the documentation of the Client's
[Allocation HTTP API](/docs/http/client-allocation-stats.html)

View File

@@ -0,0 +1,16 @@
---
layout: "docs"
page_title: "Operating a Job: Service Discovery"
sidebar_current: "docs-jobops-service-discovery"
description: |-
Learn how to use service discovery with Nomad Jobs.
---
# Using Service Discovery
Service discovery is key for applications in a dynamic environment to discover
each other. As such, Nomad has built in support for registering services and
health checks with [Consul](http://consul.io).
For more details on using service discovery with your application, see
the [Service Discovery documentation](/docs/jobspec/servicediscovery.html).

View File

@@ -0,0 +1,103 @@
---
layout: "docs"
page_title: "Operating a Job: Task Configuration"
sidebar_current: "docs-jobops-task-config"
description: |-
Learn how to ship task configuration in a Nomad Job.
---
# Task Configurations
Most tasks need to be paramaterized in some way. The simplest is via
command-line arguments but often times tasks consume complex configurations via
config files. Here we explore how to configure Nomad jobs to support many
common configuration use cases.
## Command-line Arguments
The simplest type of configuration to support is tasks which take their
configuration via command-line arguments that will not change.
Nomad has many [drivers](/docs/drivers/index.html) and most support passing
arguments to their tasks via the `args` parameter. To configure these simply
provide the appropriate arguments. Below is an example using the [`docker`
driver](/docs/drivers/docker.html) to launch `memcached(8)` and set its thread count
to 4, increase log verbosity, as well as assign the correct port and address
bindings using interpolation:
```
task "memcached" {
driver = "docker"
config {
image = "memcached:1.4.27"
args = [
# Set thread count
"-t", "4",
# Enable the highest verbosity logging mode
"-vvv",
# Use interpolations to limit memory usage and bind
# to the proper address
"-m", "${NOMAD_MEMORY_LIMIT}",
"-p", "${NOMAD_PORT_db}",
"-l", "${NOMAD_ADDR_db}"
]
network_mode = "host"
}
resources {
cpu = 500 # 500 Mhz
memory = 256 # 256MB
network {
mbits = 10
port "db" {
}
}
}
}
```
In the above example, we see how easy it is to pass configuration options using
the `args` section and even see how
[interpolation](docs/jobspec/interpreted.html) allows us to pass arguments
based on the dynamic port and address Nomad chose for this task.
## Config Files
Often times applications accept their configurations using configuration files
or have so many arguments to be set it would be unwieldy to pass them via
arguments. Nomad supports downloading
[`artifacts`](/docs/jobspec/index.html#artifact_doc) prior to launching tasks.
This allows shipping of configuration files and other assets that the task
needs to run properly.
An example can be seen below, where we download two artifacts, one being the
binary to run and the other beings its configuration:
```
task "example" {
driver = "exec"
config {
command = "my-app"
args = ["-config", "local/config.cfg"]
}
# Download the binary to run
artifact {
source = "http://domain.com/example/my-app"
}
# Download the config file
artifact {
source = "http://domain.com/example/config.cfg"
}
}
```
Here we can see a basic example of downloading static configuration files. By
default, an `artifact` is downloaded to the task's `local/` directory but is
[configurable](/docs/jobspec/index.html#artifact_doc).

View File

@@ -0,0 +1,174 @@
---
layout: "docs"
page_title: "Operating a Job: Updating Jobs"
sidebar_current: "docs-jobops-updating"
description: |-
Learn how to do safely update Nomad Jobs.
---
# Updating a Job
When operating a service, updating the version of the job will be a common task.
Under a cluster scheduler the same best practices apply for reliably deploying
new versions including: rolling updates, blue-green deploys and canaries which
are special cased blue-green deploys. This section will explore how to do each
of these safely with Nomad.
## Rolling Updates
In order to update a service without introducing down-time, Nomad has build in
support for rolling updates. When a job specifies a rolling update, with the
below syntax, Nomad will only update `max-parallel` number of task groups at a
time and will wait `stagger` duration before updating the next set.
```
job "rolling" {
...
update {
stagger = "30s"
max_parallel = 1
}
...
}
```
We can use the [`nomad plan` command](/docs/commands/plan.html) while updating
jobs to ensure the scheduler will do as we expect. In this example, we have 3
web server instances that we want to update their version. After the job file
was modified we can run `plan`:
```
$ nomad plan my-web.nomad
+/- Job: "my-web"
+/- Task Group: "web" (3 create/destroy update)
+/- Task: "web" (forces create/destroy update)
+/- Config {
+/- image: "nginx:1.10" => "nginx:1.11"
port_map[0][http]: "80"
}
Scheduler dry-run:
- All tasks successfully allocated.
- Rolling update, next evaluation will be in 10s.
Job Modify Index: 7
To submit the job with version verification run:
nomad run -check-index 7 my-web.nomad
When running the job with the check-index flag, the job will only be run if the
server side version matches the the job modify index returned. If the index has
changed, another user has modified the job and the plan's results are
potentially invalid.
```
Here we can see that Nomad will destroy the 3 existing tasks and create 3
replacements but it will occur with a rolling update with a stagger of `10s`.
For more details on the update block, see
the [Jobspec documentation](/docs/jobspec/index.html#update).
## Blue-green and Canaries
Blue-green deploys have serveral names, Red/Black, A/B, Blue/Green, but the
concept is the same. The idea is to have two sets of applications with only one
of them being live at a given time, except while transistion from one set to
another. What the term "live" means is that the live set of applications are
the set receiving traffic.
So imagine we have an API server that has 10 instances deployed to production
at version 1 and we want to upgrade to version 2. Hopefully the new version has
been tested in a QA environment and is now ready to start accepting production
traffic.
In this case we would consider version 1 to be the live set and we want to
transistion to version 2. We can model this workflow with the below job:
```
job "my-api" {
...
group "api-green" {
count = 10
task "api-server" {
driver = "docker"
config {
image = "api-server:v1"
}
}
}
group "api-blue" {
count = 0
task "api-server" {
driver = "docker"
config {
image = "api-server:v2"
}
}
}
}
```
Here we can see the live group is "api-green" since it has a non-zero count. To
transistion to v2, we up the count of "api-blue" and down the count of
"api-green". We can now see how the canary process is a special case of
blue-green. If we set "api-blue" to `count = 1` and "api-green" to `count = 9`,
there will still be the original 10 instances but we will be testing only one
instance of the new version, essentially canarying it.
If at any time we notice that the new version is behaving incorrectly and we
want to roll back, all that we have to do is drop the count of the new group to
0 and restore the original version back to 10. This fine control lets job
operators be confident that deployments will not cause down time. If the deploy
is successful and we fully transistion from v1 to v2 the job file will look like
this:
```
job "my-api" {
...
group "api-green" {
count = 0
task "api-server" {
driver = "docker"
config {
image = "api-server:v1"
}
}
}
group "api-blue" {
count = 10
task "api-server" {
driver = "docker"
config {
image = "api-server:v2"
}
}
}
}
```
Now "api-blue" is the live group and when we are ready to update the api to v3,
we would modify "api-green" and repeat this process. The rate at which the count
of groups are incremented and decremented is totally up to the user. It is
usually good practice to start by transistion one at a time until a certain
confidence threshold is met based on application specific logs and metrics.
## Handling Drain Signals
On operating systems that support signals, Nomad will signal the application
before killing it. This gives the application time to gracefully drain
connections and conduct any other cleanup that is necessary. Certain
applications take longer to drain than others and as such Nomad lets the job
file specify how long to wait inbetween signaling the application to exit and
forcefully killing it. This is configurable via the `kill_timeout`. More details
can be seen in the [Jobspec documentation](/docs/jobspec/index.html#kill_timeout).

View File

@@ -150,6 +150,8 @@ The `job` object supports the following keys:
and defaults to `service`. To learn more about each scheduler type visit
[here](/docs/jobspec/schedulers.html)
<a id="update"></a>
* `update` - Specifies the task's update strategy. When omitted, rolling
updates are disabled. The `update` block supports the following keys:
@@ -266,12 +268,16 @@ The `task` object supports the following keys:
* `meta` - Annotates the task group with opaque metadata.
<a id="kill_timeout"></a>
* `kill_timeout` - `kill_timeout` is a time duration that can be specified using
the `s`, `m`, and `h` suffixes, such as `30s`. It can be used to configure the
time between signaling a task it will be killed and actually killing it.
time between signaling a task it will be killed and actually killing it. Nomad
sends an `os.Interrupt` which on Unix systems is defined as `SIGINT`. After
the timeout a kill signal is sent (on Unix `SIGKILL`).
* `logs` - Logs allows configuring log rotation for the `stdout` and `stderr`
buffers of a Task. See the log rotation reference below for more details.
buffers of a Task. See the [log rotation section](#log_rotation) for more details.
* `artifact` - Defines an artifact to be downloaded before the task is run. This
can be provided multiple times to define additional artifacts to download. See
@@ -389,6 +395,8 @@ The `constraint` object supports the following keys:
redundant since when placed at the job level, the constraint will be applied
to all task groups.
<a id="log_rotation"></a>
### Log Rotation
The `logs` object configures the log rotation policy for a task's `stdout` and
@@ -415,10 +423,10 @@ In the above example we have asked Nomad to retain 3 rotated files for both
`stderr` and `stdout` and size of each file is 10MB. The minimum disk space that
would be required for the task would be 60MB.
### Artifact
<a id="artifact_doc"></a>
### Artifact
Nomad downloads artifacts using
[`go-getter`](https://github.com/hashicorp/go-getter). The `go-getter` library
allows downloading of artifacts from various sources using a URL as the input

View File

@@ -35,6 +35,30 @@
<a href="/docs/cluster/bootstrapping.html">Creating a Cluster</a>
</li>
<li<%= sidebar_current("docs-jobops") %>>
<a href="/docs/jobops/index.html">Operating a Job</a>
<ul class="nav">
<li<%= sidebar_current("docs-jobops-task-config") %>>
<a href="/docs/jobops/taskconfig.html">Task Configuration</a>
</li>
<li<%= sidebar_current("docs-jobops-inspection") %>>
<a href="/docs/jobops/inspecting.html">Inspecting State</a>
</li>
<li<%= sidebar_current("docs-jobops-resource-utilization") %>>
<a href="/docs/jobops/resources.html">Resource Utilization</a>
</li>
<li<%= sidebar_current("docs-jobops-service-discovery") %>>
<a href="/docs/jobops/servicediscovery.html">Service Discovery</a>
</li>
<li<%= sidebar_current("docs-jobops-logs") %>>
<a href="/docs/jobops/logs.html">Accessing Logs</a>
</li>
<li<%= sidebar_current("docs-jobops-updating") %>>
<a href="/docs/jobops/updating.html">Updating Jobs</a>
</li>
</ul>
</li>
<li<%= sidebar_current("docs-upgrade") %>>
<a href="/docs/upgrade/index.html">Upgrading</a>
<ul class="nav">