Commit Graph

51 Commits

Author SHA1 Message Date
Jorge Marey
25426f0777 fingerprint: add config option to disable dmidecode (#25108) 2025-02-13 11:20:48 -05:00
Piotr Kazmierczak
0bc9796d3b client: log an error message if total detected cpu is zero (#23827) 2024-08-15 18:31:27 +02:00
Tim Gross
7d73065066 numa: fix scheduler panic due to topology serialization bug (#23284)
The NUMA topology struct field `NodeIDs` is a `idset.Set`, which has no public
members. As a result, this field is never serialized via msgpack and persisted
in state. When `numa.affinity = "prefer"`, the scheduler dereferences this nil
field and panics the scheduler worker.

Ideally we would fix this by adding a msgpack serialization extension, but
because the field already exists and is just always empty, this breaks RPC wire
compatibility across upgrades. Instead, create a new field that's populated at
the same time we populate the more useful `idset.Set`, and repopulate the set on
demand.

Fixes: https://hashicorp.atlassian.net/browse/NET-9924
2024-06-11 08:55:00 -04:00
Seth Hoenig
83720740f5 core: plumbing to support numa aware scheduling (#18681)
* core: plumbing to support numa aware scheduling

* core: apply node resources compatibility upon fsm rstore

Handle the case where an upgraded server dequeus an evaluation before
a client triggers a new fingerprint - which would be needed to cause
the compatibility fix to run. By running the compat fix on restore the
server will immediately have the compatible pseudo topology to use.

* lint: learn how to spell pseudo
2023-10-19 15:09:30 -05:00
Seth Hoenig
591394fb62 drivers: plumb hardware topology via grpc into drivers (#18504)
* drivers: plumb hardware topology via grpc into drivers

This PR swaps out the temporary use of detecting system hardware manually
in each driver for using the Client's detected topology by plumbing the
data over gRPC. This ensures that Client configuration is taken to account
consistently in all references to system topology.

* cr: use enum instead of bool for core grade

* cr: fix test slit tables to be possible
2023-09-18 08:58:07 -05:00
Seth Hoenig
2e1974a574 client: refactor cpuset partitioning (#18371)
* client: refactor cpuset partitioning

This PR updates the way Nomad client manages the split between tasks
that make use of resources.cpus vs. resources.cores.

Previously, each task was explicitly assigned which CPU cores they were
able to run on. Every time a task was started or destroyed, all other
tasks' cpusets would need to be updated. This was inefficient and would
crush the Linux kernel when a client would try to run ~400 or so tasks.

Now, we make use of cgroup heirarchy and cpuset inheritence to efficiently
manage cpusets.

* cr: tweaks for feedback
2023-09-12 09:11:11 -05:00
hashicorp-copywrite[bot]
2d35e32ec9 Update copyright file headers to BUSL-1.1 2023-08-10 17:27:15 -05:00
Seth Hoenig
a4cc76bd3e numa: enable numa topology detection (#18146)
* client: refactor cgroups management in client

* client: fingerprint numa topology

* client: plumb numa and cgroups changes to drivers

* client: cleanup task resource accounting

* client: numa client and config plumbing

* lib: add a stack implementation

* tools: remove ec2info tool

* plugins: fixup testing for cgroups / numa changes

* build: update makefile and package tests and cl
2023-08-10 17:05:30 -05:00
Patric Stout
e190eae395 Use config "cpu_total_compute" (if set) for all CPU statistics (#17628)
Before this commit, it was only used for fingerprinting, but not
for CPU stats on nodes or tasks. This meant that if the
auto-detection failed, setting the cpu_total_compute didn't resolved
the issue.

This issue was most noticeable on ARM64, as there auto-detection
always failed.
2023-07-19 13:30:47 -05:00
hashicorp-copywrite[bot]
f005448366 [COMPLIANCE] Add Copyright and License Headers 2023-04-10 15:36:59 +00:00
Seth Hoenig
fd900d0723 client/fingerprint: correctly fingerprint E/P cores of Apple Silicon chips (#16672)
* client/fingerprint: correctly fingerprint E/P cores of Apple Silicon chips

This PR adds detection of asymetric core types (Power & Efficiency) (P/E)
when running on M1/M2 Apple Silicon CPUs. This functionality is provided
by shoenig/go-m1cpu which makes use of the Apple IOKit framework to read
undocumented registers containing CPU performance data. Currently working
on getting that functionality merged upstream into gopsutil, but gopsutil
would still not support detecting P vs E cores like this PR does.

Also refactors the CPUFingerprinter code to handle the mixed core
types, now setting power vs efficiency cpu attributes.

For now the scheduler is still unaware of mixed core types - on Apple
platforms tasks cannot reserve cores anyway so it doesn't matter, but
at least now the total CPU shares available will be correct.

Future work should include adding support for detecting P/E cores on
the latest and upcoming Intel chips, where computation of total cpu shares
is currently incorrect. For that, we should also include updating the
scheduler to be core-type aware, so that tasks of resources.cores on Linux
platforms can be assigned the correct number of CPU shares for the core
type(s) they have been assigned.

node attributes before

cpu.arch                  = arm64
cpu.modelname             = Apple M2 Pro
cpu.numcores              = 12
cpu.reservablecores       = 0
cpu.totalcompute          = 1000

node attributes after

cpu.arch                  = arm64
cpu.frequency.efficiency  = 2424
cpu.frequency.power       = 3504
cpu.modelname             = Apple M2 Pro
cpu.numcores.efficiency   = 4
cpu.numcores.power        = 8
cpu.reservablecores       = 0
cpu.totalcompute          = 37728

* fingerprint/cpu: follow up cr items
2023-03-28 08:27:58 -05:00
Michael Schurter
2e059c624f fingerprint: add node attr for reserverable cores (#14694)
* fingerprint: add node attr for reserverable cores

Add an attribute for the number of reservable CPU cores as they may
differ from the existing `cpu.numcores` due to client configuration or
OS support.

Hopefully clarifies some confusion in #14676

* add changelog

* num_reservable_cores -> reservablecores
2022-09-26 13:03:03 -07:00
Nick Ethier
f897ac79e8 client/ar: thread through cpuset manager 2021-04-13 13:28:36 -04:00
Nick Ethier
03d6eb8205 client: only fingerprint reservable cores via cgroups, allowing manual override for other platforms 2021-04-13 13:28:15 -04:00
Nick Ethier
b8397a712d fingerprint: implement client fingerprinting of reservable cores
on Linux systems this is derived from the configure cpuset cgroup parent (defaults to /nomad)
for non Linux systems and Linux systems where cgroups are not enabled, the client defaults to using all cores
2021-04-13 13:28:15 -04:00
Joel May
2e17610406 Allow client.cpu_total_compute to override attr.cpu.totalcompute 2021-01-07 15:31:11 -05:00
Seth Hoenig
da1235f35b client/fingerprint/cpu: use fallback total compute value if cpu not detected
Previously, Nomad would fail to startup if the CPU fingerprinter could
not detect the cpu total compute (i.e. cores * mhz). This is common on
some EC2 instance types (graviton class), where the env_aws fingerprinter
will override the detected CPU performance with a more accurate value
anyway.

Instead of crashing on startup, have Nomad use a low default for available
cpu performance of 1000 ticks (e.g. 1 core * 1 GHz). This enables Nomad
to get past the useless cpu fingerprinting on those EC2 instances. The
crashing error message is now a log statement suggesting the setting of
cpu_total_compute in client config.

Fixes #7989
2020-12-09 10:35:58 -06:00
Danielle Tomlinson
da48a7eab3 client: Move fingerprint structs to pkg
This removes a cyclical dependency when importing client/structs from
dependencies of the plugin_loader, specifically, drivers. Due to
client/config also depending on the plugin_loader.

It also better reflects the ownership of fingerprint structs, as they
are fairly internal to the fingerprint manager.
2018-12-01 17:10:39 +01:00
Alex Dadgar
9a2c2a4f68 client uses passed logger and fix fingerprinters 2018-10-16 16:53:30 -07:00
Alex Dadgar
5e67b37aad use int64 2018-10-16 15:34:32 -07:00
Preetha Appan
3ca71ae935 Change CPU/Disk/MemoryMB to int everywhere in new resource structs 2018-10-16 16:21:42 -05:00
Alex Dadgar
e30b20e65e renames 2018-10-04 14:57:25 -07:00
Alex Dadgar
b310a54aa6 Node resources on client 2018-09-29 17:23:41 -07:00
Chelsea Holland Komlo
ba2ebbc7f9 code review fixup 2018-01-31 18:34:03 -05:00
Chelsea Holland Komlo
a9447addd3 add applicable boolean to fingerprint response
public fields and remove getter functions
2018-01-31 13:21:45 -05:00
Chelsea Holland Komlo
f5fc20a564 create safe getters and setters for fingerprint response 2018-01-26 11:22:05 -05:00
Chelsea Holland Komlo
5e8151d700 refactor Fingerprint to request/response construct 2018-01-24 11:54:02 -05:00
Michael Schurter
ca38020521 0 compute == error 2017-07-03 14:51:02 -07:00
Michael Schurter
c10f530964 Fix cpu_total_compute override 2017-07-03 14:51:02 -07:00
Alex Dadgar
bfebe1afdc rename cpu_total_compute and docs 2017-03-14 14:15:49 -07:00
Alex Dadgar
36dc330737 Various fixes
This PR:
* Uses Go 1.8 executable lookup
* Stores any err message from stats init method
* Allows overriding of Cpu Compute for hosts where it can't be detected
2017-03-14 12:56:31 -07:00
Kenjiro Nakayama
5e4dbd0ff3 tiny: Fix duplicated error message in CPU fingerprint 2016-08-07 12:49:40 +09:00
Alex Dadgar
f2e28735a5 Treat float as int 2016-06-22 15:09:39 -07:00
Alex Dadgar
d87d988491 Floor CPU MHz and total compute and mark hostname as unique 2016-06-22 15:01:36 -07:00
Sean Chittenden
e26606acfd Memoize the CPU stats. Error if CPU fingerprinting fails. 2016-06-17 12:13:53 -07:00
Sean Chittenden
e42f7d5c23 Record and use only the first Mhz from the CPU fingerprinter.
Assume all cores are the same speed.
2016-06-17 11:06:57 -07:00
Sean Chittenden
b0490efb38 In the debug log, split the unit from the measurement
awk(1) friendly is UNIX(tm) friendly.
2016-06-16 23:07:13 -07:00
Sean Chittenden
e0b4f7a080 Warn when we're unable to fingerprint the CPU Mhz 2016-06-16 23:07:13 -07:00
Sean Chittenden
a6dc002415 Explicitly call cpu.Counts() to determine the CPU core count
Much safer than counting the number of InfoStat structs returned.
2016-06-16 23:07:13 -07:00
Diptanu Choudhury
445b181fec Updated gopsutil 2016-05-28 19:42:34 -07:00
Sean Chittenden
a91f41ce14 Establish a floor of one core for the number of cores.
In most cases the upstream library [shirou/gopsutil](https://github.com/shirou/gopsutil)
needs to be fixed.
2016-05-09 12:22:40 -07:00
Sean Chittenden
fd9bcabaa8 Emit various debugging information with the results of the fingerprinter 2016-05-09 12:21:51 -07:00
Alex Dadgar
5b067a3e4f Merge fix 2015-11-05 13:46:02 -08:00
Armon Dadgar
b81105bd09 Change CPU from float64 to int 2015-09-23 11:14:32 -07:00
Chris Bednarski
ca7798268e Get average frequency of all CPUs so we can do average frequency * cores for total compute 2015-08-27 13:35:54 -07:00
Clint
d05d878dfc Merge pull request #6 from hashicorp/cpu-resources
populate CPU in Node Resources
2015-08-27 15:26:00 -05:00
Chris Bednarski
9a12a00966 Merge pull request #4 from hashicorp/f-storage-fingerprint
Add storage fingerprinter
2015-08-27 12:43:18 -07:00
Chris Bednarski
6804ec7450 Changed logs to errors; added data to node.Resources.DiskMB 2015-08-27 12:23:17 -07:00
Clint Shryock
050ee19547 populate CPU in Node Resources 2015-08-27 14:15:56 -05:00
Clint Shryock
4e5dcf5c43 Add cpu.frequency, cpu.totalcompute 2015-08-27 09:19:53 -05:00