nomad

mirror of https://github.com/kemko/nomad.git synced 2026-01-04 01:15:43 +03:00

Author	SHA1	Message	Date
Seth Hoenig	f0f6f3a18f	consul/connect: fix regression where client connect images ignored Nomad v1.0.0 introduced a regression where the client configurations for `connect.sidecar_image` and `connect.gateway_image` would be ignored despite being set. This PR restores that functionality. There was a missing layer of interpolation that needs to occur for these parameters. Since Nomad 1.0 now supports dynamic envoy versioning through the ${NOMAD_envoy_version} psuedo variable, we basically need to first interpolate ${connect.sidecar_image} => envoyproxy/envoy:v${NOMAD_envoy_version} then use Consul at runtime to resolve to a real image, e.g. envoyproxy/envoy:v${NOMAD_envoy_version} => envoyproxy/envoy:v1.16.0 Of course, if the version of Consul is too old to provide an envoy version preference, we then need to know to fallback to the old version of envoy that we used before. envoyproxy/envoy:v${NOMAD_envoy_version} => envoyproxy/envoy:v1.11.2@sha256:a7769160c9c1a55bb8d07a3b71ce5d64f72b1f665f10d81aa1581bc3cf850d09 Beyond that, we also need to continue to support jobs that set the sidecar task themselves, e.g. sidecar_task { config { image: "custom/envoy" } } which itself could include teh pseudo envoy version variable.	2020-12-14 09:47:55 -06:00
Kris Hicks	7747124ef0	Apply some suggested fixes from staticcheck (#9598 )	2020-12-10 07:29:18 -08:00
Kris Hicks	74cb28ec30	pluginmanager: WaitForFirstFingerprint times out (#9597 ) As pointed out by @tgross[1], prior to this change we would have been blocking until all managers waited for first fingerprint rather than timing out as intended. 1: https://github.com/hashicorp/nomad/pull/9590#discussion_r539534906	2020-12-10 07:27:15 -08:00
Seth Hoenig	b38ccaac3d	Merge pull request #9586 from hashicorp/f-connect-interp consul/connect: interpolate connect block	2020-12-09 13:21:50 -06:00
Kris Hicks	85ed8ddd4f	Add gosimple linter (#9590 )	2020-12-09 11:05:18 -08:00
Seth Hoenig	edf1e25d30	consul/connect: avoid extra copy of connect stanza while interpolating	2020-12-09 11:44:07 -06:00
Seth Hoenig	da1235f35b	client/fingerprint/cpu: use fallback total compute value if cpu not detected Previously, Nomad would fail to startup if the CPU fingerprinter could not detect the cpu total compute (i.e. cores * mhz). This is common on some EC2 instance types (graviton class), where the env_aws fingerprinter will override the detected CPU performance with a more accurate value anyway. Instead of crashing on startup, have Nomad use a low default for available cpu performance of 1000 ticks (e.g. 1 core * 1 GHz). This enables Nomad to get past the useless cpu fingerprinting on those EC2 instances. The crashing error message is now a log statement suggesting the setting of cpu_total_compute in client config. Fixes #7989	2020-12-09 10:35:58 -06:00
Seth Hoenig	4d0e74585a	consul/connect: interpolate connect block This PR enables job submitters to use interpolation in the connect block of jobs making use of consul connect. Before, only the name of the connect service would be interpolated, and only for a few select identifiers related to the job itself (#6853). Now, all connect fields can be interpolated using the full spectrum of runtime parameters. Note that the service name is interpolated at job-submission time, and cannot make use of values known only at runtime. Fixes #7221	2020-12-09 09:10:00 -06:00
Kris Hicks	071f4c7596	Add gocritic to golangci-lint config (#9556 )	2020-12-08 12:47:04 -08:00
Seth Hoenig	6c8ea087d6	env_aws: run ec2info to update ec2 info Use `tools/ec2info` to update the generated table of instance types. `$ go run .`	2020-12-02 09:35:03 -06:00
Seth Hoenig	4d6a166989	Merge pull request #9487 from hashicorp/f-connect-sidecar-concurrency consul/connect: default envoy concurrency to 1	2020-12-01 15:51:41 -06:00
Seth Hoenig	2a397dbda2	consul/connect: default envoy concurrency to 1 Previously, every Envoy Connect sidecar would spawn as many worker threads as logical CPU cores. That is Envoy's default behavior when `--concurrency` is not explicitly set. Nomad now sets the concurrency flag to 1, which is sensible for the default cpu = 250 Mhz resources allocated for sidecar proxies. The concurrency value can be configured in Client configuration by setting `meta.connect.proxy_concurrency`. Closes #9341	2020-12-01 13:12:45 -06:00
Michael Schurter	c60d9a98a5	Merge pull request #9435 from hashicorp/f-allocupdate-timer client: always wait 200ms before sending updates	2020-12-01 08:45:17 -08:00
Drew Bailey	61ce743228	Event Stream: Track ACL changes, unsubscribe on invalidating changes (#9447 ) * upsertaclpolicies * delete acl policies msgtype * upsert acl policies msgtype * delete acl tokens msgtype * acl bootstrap msgtype wip unsubscribe on token delete test that subscriptions are closed after an ACL token has been deleted Start writing policyupdated test * update test to use before/after policy * add SubscribeWithACLCheck to run acl checks on subscribe * update rpc endpoint to use broker acl check * Add and use subscriptions.closeSubscriptionFunc This fixes the issue of not being able to defer unlocking the mutex on the event broker in the for loop. handle acl policy updates * rpc endpoint test for terminating acl change * add comments Co-authored-by: Kris Hicks <khicks@hashicorp.com>	2020-12-01 11:11:34 -05:00
Benjamin Buzbee	6a6547b0b6	Fix RPC retry logic in nomad client's rpc.go for blocking queries (#9266 )	2020-11-30 15:11:10 -05:00
Roman Vynar	4bbe50bc53	Add compute/zone to Azure fingerprinting	2020-11-26 13:26:51 +02:00
Michael Schurter	e6fd2583fa	client: always wait 200ms before sending updates Always wait 200ms before calling the Node.UpdateAlloc RPC to send allocation updates to servers. Prior to this change we only reset the update ticker when an error was encountered. This meant the 200ms ticker was running while the RPC was being performed. If the RPC was slow due to network latency or server load and took >=200ms, the ticker would tick during the RPC. Then on the next loop only the select would randomly choose between the two viable cases: receive an update or fire the RPC again. If the RPC case won it would immediately loop again due to there being no updates to send. When the update chan receive is selected a single update is added to the slice. The odds are then 50/50 that the subsequent loop will send the single update instead of receiving any more updates. This could cause a couple of problems: 1. Since only a small number of updates are sent, the chan buffer may fill, applying backpressure, and slowing down other client operations. 2. The small number of updates sent may already be stale and not represent the current state of the allocation locally. A risk here is that it's hard to reason about how this will interact with the 50ms batches on servers when the servers under load. A further improvement would be to completely remove the alloc update chan and instead use a mutex to build a map of alloc updates. I wanted to test the lowest risk possible change on loaded servers first before making more drastic changes.	2020-11-25 11:36:51 -08:00
Michael Schurter	5b83ca0b5d	client: skip broken test and fix assertion	2020-11-18 10:01:02 -08:00
Michael Schurter	cd7226d398	client: fix interpolation in template source While Nomad v0.12.8 fixed `NOMAD_{ALLOC,TASK,SECRETS}_DIR` use in `template.destination`, interpolating these variables in `template.source` caused a path escape error. Why not apply the destination fix to source? The destination fix forces destination to always be relative to the task directory. This makes sense for the destination as a destination outside the task directory would be unreachable by the task. There's no reason to ever render a template outside the task directory. (Using `..` does allow destinations to escape the task directory if `template.disable_file_sandbox = true`. That's just awkward and unsafe enough I hope no one uses it.) There is a reason to source a template outside a task directory. At least if there weren't then I can't think of why we implemented `template.disable_file_sandbox`. So v0.12.8 left the behavior of `template.source` the more straightforward "Interpolate and validate." However, since outside of `raw_exec` every other driver uses absolute paths for `NOMAD__DIR` interpolation, this means those variables are unusable unless `disable_file_sandbox` is set. The Fix* The variables are now interpolated as relative paths only for the purpose of rendering templates. This is an unfortunate special case, but reflects the fact that the templates view of the filesystem is completely different (unconstrainted) vs the task's view (chrooted). Arguably the values of these variables should be context-specific. I think it's more reasonable to think of the "hack" as templating running uncontainerized than that giving templates different paths is a hack. TODO - [ ] E2E tests - [ ] Job validation may still be broken and prevent my fix from working? raw_exec `raw_exec` is actually broken _a different way_ as exercised by tests in this commit. I think we should probably remove these tests and fix that in a followup PR/release, but I wanted to leave them in for the initial review and discussion. Since non-containerized source paths are broken anyway, perhaps there's another solution to this entire problem I'm overlooking?	2020-11-17 22:03:04 -08:00
Wim	19934d35dc	Use correct interface for netStatus CNI plugins can return multiple interfaces, eg the bridge plugin. We need the interface with the sandbox.	2020-11-14 22:29:30 +01:00
Seth Hoenig	459112b41d	Merge pull request #9352 from hashicorp/f-artifact-headers jobspec: add support for headers in artifact stanza	2020-11-13 14:04:27 -06:00
Seth Hoenig	6c7578636c	jobspec: add support for headers in artifact stanza This PR adds the ability to set HTTP headers when downloading an artifact from an `http` or `https` resource. The implementation in `go-getter` is such that a new `HTTPGetter` must be created for each artifact that sets headers (as opposed to conveniently setting headers per-request). This PR maintains the memoization of the default Getter objects, creating new ones only for artifacts where headers are set. Closes #9306	2020-11-13 12:03:54 -06:00
Jasmine Dahilig	b85cce42fe	lifecycle: add poststop hook (#8194 )	2020-11-12 08:01:42 -08:00
Chris Baker	1df408dfb2	Merge pull request #9311 from jeromegn/allow-empty-devices Don't ignore nil devices in plugin fingerprint	2020-11-11 13:54:03 -06:00
Tim Gross	0ed0b945c9	csi: Postrun hook should not change mode (#9323 ) The unpublish workflow requires that we know the mode (RW vs RO) if we want to unpublish the node. Update the hook and the Unpublish RPC so that we mark the claim for release in a new state but leave the mode alone. This fixes a bug where RO claims were failing node unpublish. The core job GC doesn't know the mode, but we don't need it for that workflow, so add a mode specifically for GC; the volumewatcher uses this as a sentinel to check whether claims (with their specific RW vs RO modes) need to be claimed.	2020-11-11 13:06:30 -05:00
Jerome Gravel-Niquet	66ddf62931	Don't ignore nil devices in plugin fingerprint Even if a plugin sends back an empty `[]device.DeviceGroup`, it's transformed to `nil` during the RPC. Our custom device plugin is returning empty `FingerprintResponse.Devices` very often. Our temporary fix is to send a dummy `DeviceGroup` if the slice is empty. This has the effect of never triggering the "first fingerprint" and therefore timing out after 50s. In turn, this made our node exceed its hearbeat grace period when restarting it, revoking all vault tokens for its allocations, causing a restart of all our allocations because the token couldn't be renewed. Removing the logic for `f.Devices == nil` does not appear to affect the functionality of the function.	2020-11-10 16:04:22 -05:00
Seth Hoenig	52cef27176	client/fingerprint: detect unloaded dynamic bridge kernel module In Nomad v0.12.0, the client added additional fingerprinting around the presense of the bridge kernel module. The fingerprinter only checked in `/proc/modules` which is a list of loaded modules. In some cases, the bridge kernel module is builtin rather than dynamically loaded. The fix for that case is in #8721. However we were still missing the case where the bridge module is dynamically loaded, but not yet loaded during the startup of the Nomad agent. In this case the fingerprinter would believe the bridge module was unavailable when really it gets loaded on demand. This PR now has the fingerprinter scan the kernel module dependency file, which will contain an entry for the bridge module even if it is not yet loaded. In summary, the client now looks for the bridge kernel module in - /proc/modules - /lib/modules/<kernel>/modules.builtin - /lib/modules/<kernel>/modules.dep Closes #8423	2020-11-09 13:56:14 -06:00
Nick Ethier	60838c94f8	ar/groupservice: remove drivernetwork (#9233 ) * ar/groupservice: remove drivernetwork * consul: allow host address_mode to accept raw port numbers * consul: fix logic for blank address	2020-11-05 15:00:22 -05:00
Stefan Richter	55d00d77ae	Add NOMAD_JOB_ID and NOMAD_JOB_PAERENT_ID env variables (#8967 ) Beforehand tasks and field replacements did not have access to the unique ID of their job or its parent. This adds this information as new environment variables.	2020-10-23 10:49:58 -04:00
Tim Gross	8a90b7eb16	artifact/template: make destination path absolute inside taskdir (#9149 ) Prior to Nomad 0.12.5, you could use `${NOMAD_SECRETS_DIR}/mysecret.txt` as the `artifact.destination` and `template.destination` because we would always append the destination to the task working directory. In the recent security patch we treated the `destination` absolute path as valid if it didn't escape the working directory, but this breaks backwards compatibility and interpolation of `destination` fields. This changeset partially reverts the behavior so that we always append the destination, but we also perform the escape check on that new destination after interpolation so the security hole is closed. Also, ConsulTemplate test should exercise interpolation	2020-10-22 15:47:49 -04:00
Tim Gross	076db2ef6b	artifact/template: prevent file sandbox escapes Ensure that the client honors the client configuration for the `template.disable_file_sandbox` field when validating the jobspec's `template.source` parameter, and not just with consul-template's own `file` function. Prevent interpolated `template.source`, `template.destination`, and `artifact.destination` fields from escaping file sandbox.	2020-10-21 14:34:12 -04:00
Alexander Shtuchkin	1be5243d08	Implement 'batch mode' for persisting allocations on the client. (#9093 ) Fixes #9047, see problem details there. As a solution, we use BoltDB's 'Batch' mode that combines multiple parallel writes into small number of transactions. See https://github.com/boltdb/bolt#batch-read-write-transactions for more information.	2020-10-20 16:15:37 -04:00
Seth Hoenig	3b55c2fc01	client: add tests around meta and canarymeta interpolation Expanding on #9096, add tests for making sure service.Meta and service.CanaryMeta are interpolated from environment variables.	2020-10-20 12:50:29 -05:00
Jorge Marey	bb8f239fc7	Add interpolation on service canarymeta	2020-10-20 12:45:36 -05:00
Drew Bailey	7ce0b5017c	Events/msgtype cleanup (#9117 ) * use msgtype in upsert node adds message type to signature for upsert node, update tests, remove placeholder method * UpsertAllocs msg type test setup * use upsertallocs with msg type in signature update test usage of delete node delete placeholder msgtype method * add msgtype to upsert evals signature, update test call sites with test setup msg type handle snapshot upsert eval outside of FSM and ignore eval event remove placeholder upsertevalsmsgtype handle job plan rpc and prevent event creation for plan msgtype cleanup upsertnodeevents updatenodedrain msgtype msg type 0 is a node registration event, so set the default to the ignore type * fix named import * fix signature ordering on upsertnode to match	2020-10-19 09:30:15 -04:00
Nick Ethier	7b50685cf7	Consul with CNI and host_network addresses (#9095 ) * consul: advertise cni and multi host interface addresses * structs: add service/check address_mode validation * ar/groupservices: fetch networkstatus at hook runtime * ar/groupservice: nil check network status getter before calling * consul: comment network status can be nil	2020-10-15 15:32:21 -04:00
Michael Schurter	f44c04ecd1	s/0.13/1.0/g 1.0 here we come!	2020-10-14 15:17:47 -07:00
Chris Baker	797543ad4b	removed backwards-compatible/untagged metrics deprecated in 0.7	2020-10-13 20:18:39 +00:00
Seth Hoenig	bdeb73cd2c	consul/connect: dynamically select envoy sidecar at runtime As newer versions of Consul are released, the minimum version of Envoy it supports as a sidecar proxy also gets bumped. Starting with the upcoming Consul v1.9.X series, Envoy v1.11.X will no longer be supported. Current versions of Nomad hardcode a version of Envoy v1.11.2 to be used as the default implementation of Connect sidecar proxy. This PR introduces a change such that each Nomad Client will query its local Consul for a list of Envoy proxies that it supports (https://github.com/hashicorp/consul/pull/8545) and then launch the Connect sidecar proxy task using the latest supported version of Envoy. If the `SupportedProxies` API component is not available from Consul, Nomad will fallback to the old version of Envoy supported by old versions of Consul. Setting the meta configuration option `meta.connect.sidecar_image` or setting the `connect.sidecar_task` stanza will take precedence as is the current behavior for sidecar proxies. Setting the meta configuration option `meta.connect.gateway_image` will take precedence as is the current behavior for connect gateways. `meta.connect.sidecar_image` and `meta.connect.gateway_image` may make use of the special `${NOMAD_envoy_version}` variable interpolation, which resolves to the newest version of Envoy supported by the Consul agent. Addresses #8585 #7665	2020-10-13 09:14:12 -05:00
Seth Hoenig	d3a51279af	Merge pull request #9038 from hashicorp/f-ec2-table env_aws: get ec2 cpu perf data from AWS API	2020-10-12 18:55:33 -05:00
Nick Ethier	756aa11654	client: add NetworkStatus to Allocation (#8657 )	2020-10-12 13:43:04 -04:00
Yoan Blanc	c14c616194	use allow/deny instead of the colored alternatives (#9019 ) Signed-off-by: Yoan Blanc <yoan@dosimple.ch>	2020-10-12 08:47:05 -04:00
Tim Gross	1ce58e8000	csi: fix incorrect comment on csi_hook context lifetime	2020-10-09 11:03:51 -04:00
Seth Hoenig	080b2c4415	env_aws: fixup test case node attr detection	2020-10-08 12:59:07 -05:00
Seth Hoenig	53ab30870b	env_aws: get ec2 cpu perf data from AWS API Previously, Nomad was using a hand-made lookup table for looking up EC2 CPU performance characteristics (core count + speed = ticks). This data was incomplete and incorrect depending on region. The AWS API has the correct data but requires API keys to use (i.e. should not be queried directly from Nomad). This change introduces a lookup table generated by a small command line tool in Nomad's tools module which uses the Amazon AWS API. Running the tool requires AWS_* environment variables set. $ # in nomad/tools/cpuinfo $ go run . Going forward, Nomad can incorporate regeneration of the lookup table somewhere in the CI pipeline so that we remain up-to-date on the latest offerings from EC2. Fixes #7830	2020-10-08 12:01:09 -05:00
Landan Cheruka	e40b9a40b2	fingerprint: changed unique.platform.azure.hostname to unique.platform.azure.name (#9016 )	2020-10-02 16:50:12 -04:00
Javier Heredia	5fd9f1b5f5	Add consul segment fingerprint (#7214 )	2020-10-02 15:15:59 -04:00
Fredrik Hoem Grelland	eb7cc6425d	configure nomad cluster to use a Consul Namespace [Consul Enterprise] (#8849 )	2020-10-02 14:46:36 -04:00
Fredrik Hoem Grelland	8238b9f864	update consul-template to v0.25.1 (#8988 )	2020-10-01 14:08:49 -04:00
Landan Cheruka	89558ead34	client: added azure fingerprinting support (#8979 )	2020-10-01 09:10:27 -04:00

1 2 3 4 5 ...

4283 Commits