Handle the case when /sys/class/power_supply doesn't exist. Fixes
logging error spam.
Requires https://github.com/prometheus/procfs/pull/308
Signed-off-by: Ben Kochie <superq@gmail.com>
TCP "OutRsts" is the number of TCP Resets sent by the node. This can be
useful for monitoring connection failures and flooding.
Signed-off-by: Ben Kochie <superq@gmail.com>
We must know the length of the various filesystem C strings before
turning them from a byte array into a Go string, otherwise our Go
strings could contain null bytes, corrupting the label values.
Signed-off-by: David O'Rourke <david.orourke@gmail.com>
* Two new states will be added to the tcpstat collector called rx_queued_bytes and tx_queued_bytes.
For UDP datagrams an additional collector 'udp_queues' can be used to expose the total lengths of the tx_queue and rx_queue.
@SuperQ and @discordianfish this changes gives us the option to check for overloaded UDP + TCP processing.
The names of the new TCP states and the UDP metric can be discussed.
The current reasons are just:
I don't want to add another collector for the same exposed file, so I just added the new states to the tcpstat collector.
I chose the name 'udp_queue' instead of 'udpstat' as UDP has no state.
Signed-off-by: Peter Bueschel <peter.bueschel@logmein.com>
The `node_cpu_info` metric contains some information like the `model`
(which is an integer), but not the human readable model name. Also the
stepping of the processor might be interesting, since different stepping
of a processor might behave differently.
Signed-off-by: Benjamin Drung <benjamin.drung@cloud.ionos.com>
* Use `strconv.Itoa()` instead of `fmt.Sprintf()` for simple conversion.
* Eliminate copy-paste in collector setup.
Signed-off-by: Ben Kochie <superq@gmail.com>
* add a map of profilers to CPUids
`runtime.NumCPU()` returns the number of CPUs that the process can run
on. This number does not necessarily correlate to CPU ids if the
affinity mask of the process is set.
This change maintains the current behavior as default, but also allows
the user to specify a range of CPUids to use instead.
The CPU id is stored as the value of a map keyed on the profiler
object's address.
Signed-off-by: Joe Damato <jdamato@fastly.com>
Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
Signed-off-by: Daniel Hodges <hodges@uber.com>
Co-authored-by: jdamato-fsly <55214354+jdamato-fsly@users.noreply.github.com>
Many collectors depend on underlying features to be enabled. This causes
confusion about what "success" means. This changes the behavior of the
`node_scrape_collector_success` metric.
* When a collector is unable to find data don't return success.
* Catch the no data error and send to Debug log level to avoid log spam.
* Update collectors to support this new functionality.
* Fix copy-pasta mistake in infiband debug message.
Closes: https://github.com/prometheus/node_exporter/issues/1323
Signed-off-by: Ben Kochie <superq@gmail.com>
Let the node exporter collect the non-numeric data from
/sys/class/infiniband: board ID, firmware version, and HCA type.
Signed-off-by: Benjamin Drung <benjamin.drung@cloud.ionos.com>
Co-authored-by: Ben Kochie <superq@gmail.com>
Reuse the Go-only implementation already in place for FreeBSD (#385) on
Darwin, DragonflyBSD, NetBSD and OpenBSD.
Tested on all affected platforms.
Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
This exposes RAPL statistics from /sys/class/powercap.
Co-Authored-By: Ben Kochie <superq@gmail.com>
Signed-off-by: Ukri Niemimuukko <ukri.niemimuukko@intel.com>
* Add unix socket support for supervisord collector
For example:
--collector.supervisord.url=unix:///var/run/supervisor.sock
Fixesprometheus/node_exporter#262
Signed-off-by: Paul Cameron <cameronpm@gmail.com>
Integer division and the order of operations when converting Mbps to Bps
results in a loss of accuracy if the interface speeds are set low.
e.g. 100 Mbps is reported as 12000000 Bps, should be 12500000
10 Mbps is reported as 1000000 Bps, should be 1250000
Signed-off-by: Thomas Lin <t.lin@mail.utoronto.ca>
This will now use `bcstats.numbufpages` instead of `uvmexp.vnodepages`.
Inspired by OpenBSD's `src/usr.bin/top`
Signed-off-by: Matthieu Guegan <matthieu.guegan@deindeal.ch>
* Add diskstat flush request counters for Linux 5.5+
* Update tests for diskstat flush request counters with Linux 5.5+
Signed-off-by: Holger Hoffstätte <holger@applied-asynchrony.com>
* Add makefile target to update sysfs fixtures.
* Use similar style for fixtures from procfs.
* Re-pack fixtures ttar file.
Signed-off-by: Ben Kochie <superq@gmail.com>
Collect the InfiniBand port state, the physical state, and the maximum
signal transfer rate.
Signed-off-by: Benjamin Drung <benjamin.drung@cloud.ionos.com>
commit 5ef96388a978c54173e1b1ec8e7bcb41fc7d130d
Author: Sven Haardiek <sven@haardiek.de>
Date: Wed Sep 18 20:45:23 2019 +0200
block variables
Signed-off-by: Sven Haardiek <sven@haardiek.de>
commit c1177382e241994618a8ab7dd9842027d597b0df
Author: Sven Haardiek <sven@haardiek.de>
Date: Wed Sep 18 20:38:33 2019 +0200
Use SI Units
Signed-off-by: Sven Haardiek <sven@haardiek.de>
commit 04e4f99c423872d3094f21f89a8235b233a01941
Merge: 5417c98 f3538e1
Author: Sven Haardiek <sven@haardiek.de>
Date: Wed Sep 18 19:20:17 2019 +0200
Merge branch 'master' into power_supply_class
commit 5417c9820a40b37b490caedeaa3526883380b9bf
Author: Sven Haardiek <sven@haardiek.de>
Date: Wed Sep 4 23:02:39 2019 +0200
Drop averages
Signed-off-by: Sven Haardiek <sven@haardiek.de>
commit 1f1447dbe7bbdcdabebf4c968beb14c67d89dd9f
Author: Sven Haardiek <sven@haardiek.de>
Date: Wed Sep 4 22:56:00 2019 +0200
Update Copyright
Signed-off-by: Sven Haardiek <sven@haardiek.de>
commit 9677425059a3bf61cd7498cf7b5f05d5af7a626b
Merge: 0b51589 d3478a2
Author: Sven Haardiek <sven@haardiek.de>
Date: Mon Sep 2 22:02:53 2019 +0200
Merge branch 'master' into power_supply_class
commit 0b51589f390cc1b33ea4728d85fca3a3b231cf3f
Author: PrometheusBot <prometheus-team@googlegroups.com>
Date: Fri Aug 30 13:32:17 2019 +0200
makefile: update Makefile.common with newer version (#1466)
Signed-off-by: prombot <prometheus-team@googlegroups.com>
commit af2b9e849c7b69237b7fa0e9a289c929ec7173a0
Author: Boris Momčilović <boris.momcilovic@gmail.com>
Date: Tue Aug 27 14:24:11 2019 +0200
Ipvs firewall mark (#1455)
* IPVS: include firewall mark label
Signed-off-by: Boris Momčilović <boris@firstbeatmedia.com>
commit 773f99de7f699900a00b4d35340e356fe7098ee7
Author: Paul Gier <pgier@redhat.com>
Date: Tue Aug 27 02:26:19 2019 -0500
update procfs to v0.0.4 (#1457)
Signed-off-by: Paul Gier <pgier@redhat.com>
commit 6f8a4f4348f62700cbf7eeb2657851237e13c35d
Author: beorn7 <beorn@grafana.com>
Date: Tue Aug 20 18:49:12 2019 +0200
Update legendLink
This still had the 'k8s' in as it was copied and pasted from the
kubernetes-mixin.
Signed-off-by: beorn7 <beorn@grafana.com>
commit d758cf394cfbed9e87e116a24d72050066cd039a
Author: beorn7 <beorn@grafana.com>
Date: Wed Aug 14 22:24:24 2019 +0200
Make the severity of "critical" alerts configurable
This addresses the blissful scenario where single-node failures are
unproblematic. No reason to wake somebody up if a node is about to
screw itself up by filling the disk.
Signed-off-by: beorn7 <beorn@grafana.com>
commit 041b9e1e785f5f43bbef97c0c76d205181d08890
Author: beorn7 <beorn@grafana.com>
Date: Thu Aug 15 16:43:57 2019 +0200
Add line for number of cores to load graph
Backported from the node dashboard in the kubernetes-mixin.
Signed-off-by: beorn7 <beorn@grafana.com>
commit 5552bb3a6b2be1e3dd1a93dbdb9650bd0363a922
Author: beorn7 <beorn@grafana.com>
Date: Thu Aug 15 16:36:10 2019 +0200
Fix title of CPU panel to usage
We use the `mode="idle"` metric, but we are inverting it, so this is
usage, and that's intended.
Signed-off-by: beorn7 <beorn@grafana.com>
commit db0571b402233323ed7e222e53f7ef7738520f49
Author: beorn7 <beorn@grafana.com>
Date: Thu Aug 15 16:32:54 2019 +0200
node-mixin: Improve disk usage panel
- Use a stacked graph instead of a gauge as development over time is
especially useful for disk space usage.
- By only taking one metric per device into account, we avoid
double-counting for devices that are mounted multiple times.
Signed-off-by: beorn7 <beorn@grafana.com>
commit 3822e096c5d27d06b9c9a68beff81ef23f12eb36
Author: Björn Rabenstein <beorn@grafana.com>
Date: Thu Aug 15 00:40:51 2019 +0200
node-mxin: Improve nodes dashboard (#1448)
* node-mixin: Improve nodes dashboard
- Use stacking where it makes sense.
- Normalize idle CPU so that stacking is more meaningful.
- Consistently fill where stacking is used but don't fill where not.
- Fix y axis max value for Idle CPU panel.
- Fix y axis min value for memory usage panel.
- Use `$__interval` for range where applicable (and set min step
to 1m).
- Make the right Y axis for disk I/O actually work.
This is just an incremental improvements. It doesn't touch the more
involved TODOs.
Signed-off-by: beorn7 <beorn@grafana.com>
commit fbced86b9835e1b196c15ddcac01ba3cfcf369cc
Author: beorn7 <beorn@grafana.com>
Date: Tue Aug 13 21:54:28 2019 +0200
node-mixin: Fix various straight-forward issues in the USE dashboards
- Normalize cluster memory utilisation.
- Fix missing `1m` in memory saturation.
- Have both disk-related row next to each other instead with the
network row in between.
- Correctly render transmit network traffic as negative, using
`seriesOverrides` and `min: null` for the y-axis.
- Make panel and row naming consistent.
- Remove legend where it would just display a single entry with
exactly the title of the panel.
- Fix metric name in individual node CPU Saturation panel.
- Break up disk space utilisation by device in the panel for an
individual node.
NB: All of that doesn't touch any more subtle issues captured in the
various TODOs.
Signed-off-by: beorn7 <beorn@grafana.com>
commit 5bdf0625023cf7d05e0f65c6b6a1303637772ca6
Author: Sandro Jäckel <sandro.jaeckel@gmail.com>
Date: Wed Aug 7 09:19:20 2019 +0200
Update rootfs syntax in Docker example (#1443)
Signed-off-by: Sandro Jäckel <sandro.jaeckel@gmail.com>
commit b59f081d45a3ca65957900ec33772dca25a3066f
Author: Phil Frost <phil@postmates.com>
Date: Tue Aug 6 13:08:06 2019 -0400
Fix seconds reported by schedstat (#1426)
Upstream bugfix: https://github.com/prometheus/procfs/pull/191
Signed-off-by: Phil Frost <phil@postmates.com>
commit ac9a059ae81fa31f9963614483af3b5e3bfd672c
Author: Sven Haardiek <sven@haardiek.de>
Date: Sun Aug 4 20:15:36 2019 +0200
Try to make it work for PowerPC
Signed-off-by: Sven Haardiek <sven@haardiek.de>
commit c81acf3b009e8538783489d1468f33faf65d8b01
Merge: c064116 75462bf
Author: Sven Haardiek <sven@haardiek.de>
Date: Sun Aug 4 20:14:16 2019 +0200
Merge remote-tracking branch 'upstream/master' into power_supply_class
Signed-off-by: Sven Haardiek <sven@haardiek.de>
commit c0641162c3a432f29df30c8d0632a7756d7d2bff
Merge: 06f6e3e 0b710bb
Author: Sven Haardiek <sven@haardiek.de>
Date: Fri Aug 2 18:30:28 2019 +0200
Merge branch 'master' into power_supply_class
Signed-off-by: Sven Haardiek <sven@haardiek.de>
commit 06f6e3e8b2a9b2e3f345b6d312a777731bb4b403
Author: Sven Haardiek <sven.haardiek@iotec-gmbh.de>
Date: Fri Mar 22 15:36:03 2019 +0100
Fix Pull Request comments
* concise metric conditions
* combine info about power supply to one metric
Signed-off-by: Sven Haardiek <sven.haardiek@iotec-gmbh.de>
commit 785c3735c4626de56f8341f800ab7bb5e2594d08
Author: Sven Haardiek <sven@haardiek.de>
Date: Sat Mar 9 18:47:52 2019 +0100
Use sys.ttar instead of uploading the files
Signed-off-by: Sven Haardiek <sven@haardiek.de>
commit e07bff5d938457147b9009aef7d42d763018cd66
Author: Sven Haardiek <sven@haardiek.de>
Date: Sat Mar 9 18:34:50 2019 +0100
Add information about from /sys/class/power_supply
Signed-off-by: Sven Haardiek <sven@haardiek.de>
commit 55b3e34840c9dfc6513ae8e69b6479d5842a3091
Author: Sven Haardiek <sven@haardiek.de>
Date: Sat Mar 9 18:09:45 2019 +0100
Use cyclecount instead of cycle_count since it is a gauge
Signed-off-by: Sven Haardiek <sven@haardiek.de>
commit 602350b333cf9353d2cd0ffd40206c96ffe29941
Author: Sven Haardiek <sven@haardiek.de>
Date: Sat Mar 9 18:09:25 2019 +0100
other build options
Signed-off-by: Sven Haardiek <sven@haardiek.de>
commit 5aa38f678451d5b63ffdc32336345a1ff6703725
Author: Sven Haardiek <sven@haardiek.de>
Date: Sat Mar 9 18:08:56 2019 +0100
Update fixtures
Signed-off-by: Sven Haardiek <sven@haardiek.de>
commit c6acc474a4224b8d9f7b178d0d2e02636d8629ea
Author: Sven Haardiek <sven@haardiek.de>
Date: Sat Mar 9 17:20:30 2019 +0100
Update command line parameter flag
Signed-off-by: Sven Haardiek <sven@haardiek.de>
commit f5a329e6ae5ed3b16aa866d67b944f1a73edfe42
Author: Sven Haardiek <sven@haardiek.de>
Date: Sat Mar 9 17:20:06 2019 +0100
Update procfs dependency
Signed-off-by: Sven Haardiek <sven@haardiek.de>
commit 38d5fa5165643d6a44dc863b3a1696774259ac0d
Merge: 5a7ce69 28f3582
Author: Sven Haardiek <sven@haardiek.de>
Date: Sat Mar 9 16:28:29 2019 +0100
Merge branch 'power_supply_class' of github.com:shaardie/node_exporter into power_supply_class
commit 5a7ce69505079c9c090e44448cfbd7ffb2b04df7
Author: Sven Haardiek <sven@haardiek.de>
Date: Sat Oct 20 18:55:49 2018 +0200
Updated Metrics of Power Supply Class
Signed-off-by: Sven Haardiek <sven@haardiek.de>
commit 690ab1b9c1f2e183b7088cf81c7f266d85ee6df6
Author: Sven Haardiek <sven@haardiek.de>
Date: Fri Oct 19 20:03:42 2018 +0200
Start work on Power Supply Collector
Signed-off-by: Sven Haardiek <sven@haardiek.de>
commit 28f358222bbac4315fbf44d94da36d4b0ff2ed55
Author: Sven Haardiek <sven@haardiek.de>
Date: Sat Oct 20 18:55:49 2018 +0200
Updated Metrics of Power Supply Class
Signed-off-by: Sven Haardiek <sven@haardiek.de>
commit 751d99b818503e9a4430b10c39760f180349b294
Author: Sven Haardiek <sven@haardiek.de>
Date: Fri Oct 19 20:03:42 2018 +0200
Start work on Power Supply Collector
Signed-off-by: Sven Haardiek <sven@haardiek.de>
Signed-off-by: Sven Haardiek <sven@haardiek.de>
Parsing the sysfs files for InfiniBand was added to the procfs library
(see https://github.com/prometheus/procfs/pull/164).
Therefore use `InfiniBandClass` from the procfs library instead of
parsing sysfs itself.
If the port counter return `N/A (no PMA)` no metric will be returned
(instead of returning 0 for this metric.
Signed-off-by: Benjamin Drung <benjamin.drung@cloud.ionos.com>
The dbus property 'SystemState' and the timer property 'LastTriggerUSec'
were added in version 212 of systemd.
Check that the version of systemd is higher than 212 before attempting
to query these properties
f755e3b74bdedabea4b3
Resolves issue #291
Signed-off-by: Paul Gier <pgier@redhat.com>
Use the extra information gleaned from the mountinfo file to add
a 'mountaddr' field for NFS metrics. This helps prevent prometheus from
ignoring mounts that come from the same URL, but are actually from
different IP addresses.
This commit also rebases to current master
Signed-off-by: Dipack P Panjabi <dpanjabi@hudson-trading.com>
As per prometheus/client_golang#543, pass the Registry for exporter
metrics when setting up the /metrics HTTP handler.
With this, the `promhttp_metric_handler_errors_total` metric will
increment on (possibly non-fatal) collection-time errors, such as
duplicate metrics from text files.
Signed-off-by: Matthias Rampke <mr@soundcloud.com>
Include directory operation, read/write system call, and vnode runtime
statistics for XFS filesystems.
Signed-off-by: Steven Kreuzer <skreuzer@FreeBSD.org>
Based on the solaris implementation. There's a lot of other sysctls
available on FreeBSD that aren't reported here. It'll be easy to add,
if they're useful. All of the sysctls are uint64.
Signed-off-by: Derek Marcotte <554b8425@razorfever.net>
* Closes issue #261 on node_exporter.
Delegated mdstat parsing to procfs project. mdadm_linux.go now only exports the metrics.
-> Added disk labels: "fail", "spare", "active" to indicate disk status
-> hanged metric node_md_disks_total ==> node_md_disks_required
-> Removed test cases for mdadm_linux.go, as the functionality they tested for has been moved to procfs project.
Signed-off-by: Advait Bhatwadekar <advait123@ymail.com>
The filesystem is read-only and is often used for a virtual FS
with a configuration file for a virtual machine.
Signed-off-by: Leonid Evdokimov <leon@darkk.net.ru>
* Update procfs vendor to pull in github.com/prometheus/procfs/pull/165
* Update mountstats collector to use new types.
* Rollover counter automatically to avoid float64 accuracy issues.
* Update e2e test.
Signed-off-by: Ben Kochie <superq@gmail.com>
* Add --collector.netdev.device-whitelist flag
Sometimes it is desired to monitor only one netdev. The golang regexp
does not support a negated regex, so the ignored-devices flag is too
cumbersome for this task.
This change introduces a new flag: accept-devices, which is mutually
exclusive to ignored-devices. This flag allows specifying ONLY the
netdev you'd like.
Signed-off-by: Noam Meltzer <noam@cynerio.co>
According to the golang docs, the syscall package is deprecated.
https://golang.org/pkg/syscall
This updates collectors to use the x/sys/unix package instead.
Also updates the vendored x/sys/unix module to latest.
Signed-off-by: Paul Gier <pgier@redhat.com>
Previously, the node_textfile_mtime_seconds metric was based on the
Fileinfo.ModTime() of the ioutil.ReadDir() return value. This is based
on lstat() and therefore has unintended consequences for symlinks
(modification time of the symlink instead of the symlink target is
returned). It is also racy as the lstat() is performed before reading
the file.
This commit changes the node_textfile_mtime_seconds metric to be based
on a fresh Stat() call on the open file. This eliminates the race and
works as expected for symlinks. Fixes#1324.
Signed-off-by: Christian Hoffmann <mail@hoffmann-christian.info>
This enables the collection of pressure stall information as exposed
by the `/proc/pressure` interface added in the 4.20 release of the
Linux kernel.
Closes#1174
Signed-off-by: Daniele Sluijters <daenney@users.noreply.github.com>
Minor change to match naming convention in other collectors.
Initialize the proc or sys FS instance once while initializing
each collector instead of re-creating for each metric update.
Signed-off-by: Paul Gier <pgier@redhat.com>
The cpu frequency information is not always needed and/or available.
This change allows the cpu frequency metrics to be enabled/disabled
separately from the other cpu metrics, and also prevents a frequency
metric failure (such as a parse error) from failing the main cpu
collector.
Fixes#1241
Signed-off-by: Paul Gier <pgier@redhat.com>
This reduces the system metric collection time by using a wait group
and go routines to allow the systemd metric calls happen concurrently.
Also, makes the start time, restarts, tasks_max, and tasks_current metrics disabled by default
because these can be time consuming to gather.
Signed-off-by: Paul Gier <pgier@redhat.com>
With a bond interface the state of the slave interface from the bond's
point of view is reflected in `mii_status` and is independent of the
link's `operstate`.
When a bond is monitored with `miimon`, `mii_status` will reflect the
state of the physical link as configured via the operator.
When a bond is monitored via `arp_interval` the `mii_status` will
reflect the results of the bond ARP checking. This means the link can
be down from the bond's point of view, but up from a physical
connection point of view.
If a bond is not monitored via miimon or arp, the `mii_status` should
likely be always `up`, however I have observed a case where this is not
true and the `operstate` is `up` while `mii_status` is `down`. Kernel
bond documentation stresses that a bond should not be configured without
one of `mii_mon` or `arp_interval` configured however.
This change results in the metric 'node_bonding_active' matching the
up/down state of the bond's point of view rather than operstate.
Signed-off-by: Sachi King <nakato@nakato.io>
* netclass_linux: remove varying labels from the 'up' metric
This moves the variable label values such as 'operstate' out of
the 'network_up' metric and into a separate metric called '_info'.
This allows the 'up' metric to remain continous over state changes.
Fixes#1236
Signed-off-by: Paul Gier <pgier@redhat.com>
* Rename interface to device in netclass collector
This makes it consistent with other networking metrics like node_network_receive_bytes_total
This closes#1223
Signed-off-by: Johannes 'fish' Ziemke <github@freigeist.org>
* Add diskstats collector for OpenBSD
Tested on i386 and amd64, OpenBSD 6.4 and -current.
* Refactor diskstats collectors
This moves common descriptors from Linux, Darwin, OpenBSD
diskstats collectors into diskstats_common.go
Signed-off-by: Ralf Horstmann <ralf+github@ackstorm.de>
Similar to #1228. Update the remaining collectors to use
'path/filepath' intead of 'path' for manipulating file paths.
Signed-off-by: Paul Gier <pgier@redhat.com>
Adds a new label called "type" systemd_unit_state which contains the
Type field from the unit file. This applies only to the .service and
.mount unit types. The other unit types do not include the optional
type field.
Fixes#1210
Signed-off-by: Paul Gier <pgier@redhat.com>
* netstat: Add TCP In/Out Segs
In order to get a better idea of TCP packet loss, we need to know how
many `node_netstat_Tcp_OutSegs` there are so we can compare this to
`node_netstat_Tcp_RetransSegs`.
Signed-off-by: Ben Kochie <superq@gmail.com>
* Update fixtures
Signed-off-by: Ben Kochie <superq@gmail.com>
* Add fallback for missing /proc/1/mounts
On some systems, `/proc/1/mounts` is hidden from non-root users due to
the `hidepid` procfs feature. Attempt to fallback to `/proc/mounts` if
`/proc/1/mounts` is not found.
Signed-off-by: Ben Kochie <superq@gmail.com>
* Add tests.
Signed-off-by: Ben Kochie <superq@gmail.com>
* Add CHANGELOG entry.
Signed-off-by: Ben Kochie <superq@gmail.com>
The pull request #1002 changed the logic used on Linux servers to determine if a filesystem is
read-only. As a result of this change, the variable `readOnly` is now unused and can be removed.
Signed-off-by: Jerome Froelich <jeromefroelich@hotmail.com>
Tcp SYN packet retransmits are a very useful signal as they affect
network performance disproportionately to regular TCP retransmits.
Signed-off-by: Ben Kochie <superq@gmail.com>
In some cases the file might be called "temp" instead of the usual format "temp<index>_<item>"
as described in the kernel docs: https://www.kernel.org/doc/Documentation/hwmon/sysfs-interface
In this case, treat this as an _input file containing the current temperature reading.
Fixes#1122
Signed-off-by: Paul Gier <pgier@redhat.com>
* Update Linux cpufreq collector to use new procfs library functions.
* Split thermal throttle collection to a separate function.
* Add new required fixtures and repack ttar file.
Signed-off-by: Ben Kochie <superq@gmail.com>
The format of /proc/diskstats is changing in linux-4.19 to include some
additional fields. See: https://www.kernel.org/doc/Documentation/iostats.txt
* collector/diskstats: use constants for some hard coded strings
* collector/diskstats: update diskstats for linux-4.19
* collector/diskstats: remove kernel doc url from individual metrics
Signed-off-by: Paul Gier <pgier@redhat.com>
* infiniband: Add not connected i40iw0/ports/1 fixtures
* infiniband: Handle issue when iWARP* RDMA modules are not available
This is related to #966, and handle this error,
Jun 07 13:33:24 hostname node_exporter[81888]: time="2018-06-07T13:33:24+02:00" level=error msg="ERROR: infiniband
collector failed after 0.000929s: strconv.ParseUint: parsing \"N/A (no PMA)\": invalid syntax" source="collector.go:132"
Signed-off-by: Mario Trangoni <mjtrangoni@gmail.com>
* strip rootfs prefix for run in docker
* Use `/` as default value of path.rootfs, and parse mounts from `/proc/1/mounts`.
* No need to mount `/proc` and `/sys` because we share host's PID
namespace, which allows processes within the container to see all of the
processes on the system.
Closes: #66
Signed-off-by: Ivan Mikheykin <ivan.mikheykin@flant.com>
Signed-off-by: Yecheng Fu <cofyc.jackson@gmail.com>
Starting with (not yet released) OpenBSD 6.4, sysctl KERN_CPTIME2 will
return ENODEV for offline CPUs.
SMT siblings are reported as offline when hw.smt is disabled, which is
the default since one of the later Spectre variants. So this might
affect a few systems.
For more details see:
https://cvsweb.openbsd.org/src/sys/kern/kern_sysctl.c#rev1.348
Signed-off-by: Ralf Horstmann <ralf+github@ackstorm.de>
When starting Docker containers a whole bunch of netns (network
namespace) mounts are created that the node exporter can't make any
sense of (and can't read either).
This ignores all nsfs filesystems.
Fixes#875
Signed-off-by: Daniele Sluijters <daenney@users.noreply.github.com>
* Change systemd unit filtering
Get all units from systemd and filter in Go.
* Improves compatibility with older versions of systemd.
* Improve debugging by printing when units pass the filter.
* Remove extraneous newlines from log messages.
Signed-off-by: Ben Kochie <superq@gmail.com>
This removes the cgo import from timex collector, as it was only used
to define two constants. Those are part of the Linux kernel<->userspace
interface, thus there is no need to depend on libc to source them:
https://github.com/torvalds/linux/blob/v4.18/include/uapi/linux/timex.h
Signed-off-by: Luca Bruno <luca.bruno@coreos.com>
* Correctly cast Darwin memory info
* Cast stats to float64 before doing math on them to avoid integer
wrapping.
* Remove invalid `_total` suffix from gauge values.
* Handle counters in `meminfo.go`.
Signed-off-by: Ben Kochie <superq@gmail.com>
Fix typo on unit description of metric `*read_time_seconds_total` from milliseconds to seconds.
Signed-off-by: Marco Tulio R Braga <marco.tulio@mtulio.eng.br>
* If NRestarts or NRefused are not available, don't ignore the unit itself
* Don't report systemd metrics (NRestarts/NRefused) that are not available
Signed-off-by: James Hartig <james@getadmiral.com>
PIDs can vanish (exit) from /proc/ between gathering the list of PIDs
and getting all of their stats.
* Ignore file not found errors.
* Explicitly count the PIDs we find.
* Cleanup some error style issues.
Signed-off-by: Ben Kochie <superq@gmail.com>
* Replace supervisord xmlrpc library
* Use `github.com/mattn/go-xmlrpc` that doesn't leak goroutines.
* Fix uptime metric
* Use Prometheus best practices for uptime metric.
* Use "start time" rather than "uptime".
* Don't emit a start time if the process is down.
* Add changelog entry.
* Add example compatibility rules.
Signed-off-by: Ben Kochie <superq@gmail.com>
* vendor: Update prometheus/procfs
Signed-off-by: Hannes Körber <hannes.koerber@haktec.de>
* mountstats: Use new NFS protocol field
In https://github.com/prometheus/procfs/pull/100, the NFSTransportStats
struct was expanded by a field called protocol that specifies the NFS
protocol in use, either "tcp" or "udp". This commit adds the protocol as
a label to all NFS metrics exported via the mountstats collector.
Signed-off-by: Hannes Körber <hannes.koerber@haktec.de>
* Update fixtures for UDP mount
Signed-off-by: Hannes Körber <hannes.koerber@haktec.de>
It is quite common to put /var/lib/docker itself on a separate partition
and that should be monitored as well.
Signed-off-by: Johannes Wienke <languitar@semipol.de>
While the statfs(2) approach is reliable for normally mounted filesystems, the
flags returned can be inconsistent when filesystem has been remounted read-only
after encountering an error. The returned flags do accurately represent the
internal state of the filesystem, but they do not reflect whether the VFS layer
will accept writes. Instead, it makes sense to parse the current VFS mount
state from the options field in /proc/mounts since it takes precedence.
Signed-off-by: Brandon Gilmore <bgilmore@valvesoftware.com>
* add sys/class/net parsing from procfs and expose its metrics
Signed-off-by: Jan Klat <jenik@klatys.cz>
* change code to use int pointers per procfs change, move netclass to separate collector, change metric naming
Signed-off-by: Jan Klat <jenik@klatys.cz>
* bump year in licence, remove redundant newline, correct fixtures
Signed-off-by: Jan Klat <jenik@klatys.cz>
* fix style
Signed-off-by: Jan Klat <jenik@klatys.cz>
* change carrier changes to counter type
Signed-off-by: Jan Klat <jenik@klatys.cz>
* fix e2e output
Signed-off-by: Jan Klat <jenik@klatys.cz>
* add fixtures
Signed-off-by: Jan Klat <jenik@klatys.cz>
* update vendor, use fixtures correctly
Signed-off-by: Jan Klat <jenik@klatys.cz>
* change fixtures (device in /sys/class/net should be symlinked)
Signed-off-by: Jan Klat <jenik@klatys.cz>
* correct fixtures for 64k page, updated readme
Signed-off-by: Jan Klat <jenik@klatys.cz>
Fixed spelling mistakes.
Update transport_generic.go
Changed to a mutex approach instead of channels and added a timeout before declaring a mount stuck.
Removed unnecessary lock channel and clarified some var names.
Fixed style nits.
Signed-off-by: Mark Knapp <mknapp@hudson-trading.com>
* Add support for NRestarts counter introduced in systemd 235
`.service` units increment this counter any time the Restart= condition is
triggered.
Signed-off-by: Matthew McGinn <mamcgi@gmail.com>
* Send "Personality unknown" to debug, not info, remove unnecessary newline.
* Add support for "linear" personality.
* Always set number of active disks to 0 when a device is inactive.
* Add total disks calculation to unknown personalites.
Signed-off-by: Ben Kochie <superq@gmail.com>
* Fix for #945, cpu temperature is signed.
Added a type conversion to cpu temperature sysctl. Will still
collect/report -1 when the value is -1, this is because it should be up
to interpretation whether this is the correct value for the system or
not.
Some drivers will report -1 for cpu temperature. Other sensors will
report "an input into the fan control algorithm", i.e. not the actual
temperature, but how much fan it wants. Some people cool their machines
with liquid nitrogen.
Signed-off-by: Derek Marcotte <554b8425@razorfever.net>
* Do not rely on AArch64 CPUs to support 32-bit ARM for cross-testing.
Signed-off-by: Alexey Kopytov <akopytov@gmail.com>
* aarch64 like ppc64le reports 64k node_sockstat_TCP_mem_bytes due to 64k pages.
Signed-off-by: Alexey Kopytov <akopytov@gmail.com>
* cpu: Add a 2nd label 'package' to metric node_cpu_core_throttles_total
This commit fixes the node_cpu_core_throttles_total metrics on
multi-socket systems as the core_ids are the same for each package.
I.e. we need to count them seperately.
Rename the node_package_throttles_total metric label `node` to `package`.
Reorganize the sys.ttar archive and use the same symlinks as the Linux
kernel. Also, the new fixtures now use a dual-socket dual-core cpu w/o
HT/SMT (node0: cpu0+1, node1: cpu2+3) as well as processor-less
(memory-only) NUMA node 'node2' (this is a very rare case).
Signed-off-by: Karsten Weiss <knweiss@gmail.com>
* cpu: Use the direct /sys path to the cpu files.
Use the direct path /sys/devices/system/cpu/cpu[0-9]* (without symlinks)
instead of /sys/bus/cpu/devices/cpu[0-9]*.
The latter path also does not exist e.g. on RHEL 6.9's kernel.
Signed-off-by: Karsten Weiss <knweiss@gmail.com>
* cpu: Reverse core+package throttle processing order
Signed-off-by: Karsten Weiss <knweiss@gmail.com>
* cpu: Add documentation URLs
Signed-off-by: Karsten Weiss <knweiss@gmail.com>
Netstat is 40% of the metrics on my laptop, many of which
are highly detailed information about IP internals in the kernel.
~300 such metrics on every machine in your fleet is excessive,
so focus on key metrics by default, overridable by the user.
Fixes#515
Signed-off-by: Brian Brazil <brian.brazil@robustperception.io>
Vmstat has over 100 fields, most of which are highly
detailed debug information. Trim this down to only
essential fields by default, configurable by flag.
Signed-off-by: Brian Brazil <brian.brazil@robustperception.io>
* Only report core throttles per core, not per cpu
* Add topology/core_id to the cpu sysfs fixtures
* Add new cpu fixtures to ttar file
* Merge core_id reading and thermal throttle accounting
* Declare core_id
* updates for zfsonlinux 0.7.5
* add constants for KSTAT_DATA_* types
* added e2e test for negative values represented by uint64 that can result from ZFS bugs
Enable NFS client metrics by default now that it nolonger prints errors
on scrape if there are no metrics to display.
Also fixup the nfsd README to match the nfs entry.
All tools in OpenBSD base system use swpginuse instead of swpgonly
for reporting swap usage (snmpd, swapctl, top, vmstat), so let
memory collector use that as well for consistency.
* Add overlay to defIgnoredFSTypes
To avoid statfs() errors if node_exporter is running as non privileged user.
* Updated defIngoredFSTypes values in sorted order
* Update vendor github.com/prometheus/procfs/...
* Refactor NFS collector
Use new procfs library to parse NFS client stats.
* Ignore nfs proc file not existing.
* Refactor with reflection to walk the structs.
The node exporter runs unprivileged, so it cannot statfs any filesystems
under this directory causing log spam. In addition there tends to be
high churn in the filesystems here (as it's basically application
monitoring) which can cause high cardinaltiy and in one case caused
Prometheus's index symbol table to get very large.
Accordingly this should be ignored to reduce log spam and avoid
performance issues. The filesystems themselves can in principle be
monitored via container oriented exporters, and the underlying
filesystems will still be monitored.
* Unify CPU collector conventions
Add a common CPU metric description.
* All collectors use the same `nodeCpuSecondsDesc`.
* All collectors drop the `cpu` prefix for `cpu` label values.
* Fix subsystem string in cpu_freebsd.
* Fix Linux CPU freq label names.
* Improve stat linux metric names.
cpu is no longer used.
* node_cpu -> node_cpu_seconds_total for Linux
* Improve filesystem metric names with units
* Improve units and names of linux disk stats
Remove sector metrics, the bytes metrics cover those already.
* Infiniband counters should end in _total
* Improve timex metric names, convert to more normal units.
See
3c073991eb/kernel/time/ntp.c (L909)
for what stabil means, looks like a moving average of some form.
* Update test fixture
* For meminfo metrics that had "kB" units, add _bytes
* Interrupts counter should have _total
* Move FreeBSD/DragonflyBSD out of meminfo add kvm.
This gives us SwapUsed, and everything under one roof.
* Fix typos per review.
* Update to use newer API.
* Remove premature optimization per PR feedback.
* Implements meminfo collector for OpenBSD
This is a rework of #151.
* Fix CGO import
* Add some useful metrics
* Rename total -> size for normalization
* remove injection hook for textfile metrics, convert them to prometheus format
* add support for summaries
* add support for histograms
* add logic for handling inconsistent labels within a metric family for counter, gauge, untyped
* change logic for parsing the metrics textfile
* fix logic to adding missing labels
* Export time and error metrics for textfiles
* Add tests for new textfile collector, fix found bugs
* refactor Update() to split into smaller functions
* remove parseTextFiles(), fix import issue
* add mtime metric directly to channel, fix handling of mtime during testing
* rename variables related to labels
* refactor: add default case, remove if guard for metrics, remove extra loop and slice
* refactor: remove extra loop iterating over metric families
* test: add test case for different metric type, fix found bug
* test: add test for metrics with inconsistent labels
* test: add test for histogram
* test: add test for histogram with extra dimension
* test: add test for summary
* test: add test for summary with extra dimension
* remove unnecessary creation of protobuf
* nit: remove extra blank line
Linux "guest" metrics for VMs are already accounted for in node_cpu
`user` and `nice` metrics. Separate these into their own metric to
avoid duplication of data.
* cpu: Support processor-less (memory-only) NUMA nodes
Processor-less (memory-only) NUMA nodes exist e.g. in systems that use
Intel Optane drives for RAM expansion using Intel Memory Drive
Technology (IMDT).
IMDT RAM expansion supports two modes:
* "Unify Remote Memory domains": present a processor-less (memory-only)
NUMA domain, which is the default
* "Expand local memory domains": to expand each processor’s memory domain
with a portion of the memory made available by Optane and IMDT
This commit fixes a crash in the first case (when "cpulist" is empty).
Here's an example of such a system:
$ numastat -m|head -n5
Per-node system memory usage (in MBs):
Node 0 Node 1 Node 2 Total
--------------- --------------- --------------- ---------------
MemTotal 118239.56 130816.00 464384.00 713439.56
$ for i in {0..2}; do echo -n "$i: " ; cat /sys/bus/node/devices/node$i/cpulist ; done
0: 0-7,16-23
1: 8-15,24-31
2:
$ /opt/vsmp/bin/vsmpversion -vvv
Memory Drive Technology: 8.2.1455.74 (Sep 28 2017 13:09:59)
System configuration:
Boards: 3
1 x Proc. + I/O + Memory
2 x NVM devices (Intel SSDPED1K375GAQ)
Processors: 2, Cores: 16, Threads: 32
Intel(R) Xeon(R) CPU E5-2667 v4 @ 3.20GHz Stepping 01
Memory (MB): 713472 (of 977450), Cache: 251416, Private: 12562
1 x 249088MB [262036/ 678/12270]
1 x 232192MB [357707/125369/ 146] 82:00.0#1
1 x 232192MB [357707/125369/ 146] 83:00.0#1
* cpu: rename some variables (pkg => node)
* cpu: Use %v not %q in log.Debugf() format strings
* Update golang.org/x/sys/unix
This allows to use simplified string conversion of Utsname members.
* Simplify Utsname string conversion
Use Utsname from golang.org/x/sys/unix which contains byte array
instead of int8/uint8 array members. This allows to simplify the string
conversions of these members.
* Correct buffer_bytes > INT_MAX on BSD/amd64.
The sysctl vfs.bufspace returns either an int or a long, depending on
the value. Large values of vfs.bufspace will result in error messages
like:
couldn't get meminfo: cannot allocate memory
This will detect the returned data type, and cast appropriately.
* Added explicit length checks per feedback.
* Flatten Value() to make it easier to read.
* Simplify per feedback.
* Fix style.
* Doc updates.
The github.com/beevik/ntp package was recently updated with some
API changes that broke node_exporter. This commit fetches the
latest version of the ntp package and brings node_exporter in
line with the latest API.
* Move NodeCollector into package collector
* Refactor collector enabling
* Update README with new collector enabled flags
* Fix out-of-date inline flag reference syntax
* Use new flags in end-to-end tests
* Add flag to disable all default collectors
* Track if a flag has been set explicitly
* Add --collectors.disable-defaults to README
* Revert disable-defaults flag
* Shorten flags
* Fixup timex collector registration
* Fix end-to-end tests
* Change procfs and sysfs path flags
* Fix review comments
This collector is based on adjtimex(2) system call. The collector returns
three values, status if time is synchronised, offset to remote reference,
and local clock frequency adjustment.
Values are taken from kernel time keeping data structures to avoid getting
involved how the synchronisation is implemented. By that I mean one should
not care if time is update using ntpd, systemd.timesyncd, ptpd, and so on.
Since all time sync implementation will always end up telling to kernel what
is the status with time one can simply omit the software in between, and
look results of the syncing. As a positive side effect this makes collector
very quick and conceptually specific, this does not monitor availability of
NTP server, or network in between, or dns resolution, and other unrelated
but necessary things.
Minimum set of values to keep eye on are the following three:
The node_timex_sync_status tells if local clock is in sync with a remote
clock. Value is set to zero when synchronisation to a reliable server
is lost, or a time sync software is misconfigured.
The node_timex_offset_seconds tells how much local clock is off when
compared to reference. In case of multiple time references this value
is outcome of RFC 5905 adjustment algorithm. Ideally offset should be
close to zero, and it depends about use case how large value is
acceptable. For example a typical web server is probably fine if offset
is about 0.1 or less, but that would not be good enough for mobile phone
base station operator.
The node_timex_freq tells amount of adjustment to local clock tick
frequency. For example if offset is one second and growing the local
clock will need instruction to tick quicker. Number value itself is not
very important, and occasional small adjustments are fine. When
frequency is unusually in stable one can assume quality of time stamps
will not be accurate to very far in sub second range. Obviously
explaining why local clock frequency behaves like a passenger in roller
coaster is different matter. Explanations can vary from system load, to
environmental issues such as a machine being physically too hot.
Rest of the measurements can help when debugging. If you run a clock server
do probably want to collect and keep track of everything.
Pull-request: https://github.com/prometheus/node_exporter/pull/664
* Add metrics from SNTPv4 packet to ntp collector & add ntpd sanity check
1. Checking local clock against remote NTP daemon is bad idea, local
ntpd acting as a client should do it better and avoid excessive load on
remote NTP server so the collector is refactored to query local NTP
server.
2. Checking local clock against remote one does not check local ntpd
itself. Local ntpd may be down or out of sync due to network issues, but
clock will be OK.
3. Checking NTP server using sanity of it's response is tricky and
depends on ntpd implementation, that's why common `node_ntp_sanity`
variable is exported.
* `govendor add golang.org/x/net/ipv4`, it is dependency of github.com/beevik/ntp
* Update github.com/beevik/ntp to include boring SNTP fix
* Use variable name from RFC5905
* ntp: move code to make export of raw metrics more explicit
* Move NTP math to `github.com/beevik/ntp`
* Make `golint` happy
* Add some brief docs explaining `ntp` #655 and `timex` #664 modules
* ntp: drop XXX comment that got its decision
* ntp: add `_seconds` suffix to relevant metrics
* Better `node_ntp_leap` comment
* s/node_ntp_reftime/node_ntp_reference_timestamp_seconds/ as requested by @discordianfish
* Extract subsystem name to const as suggested by @SuperQ
* cpu: Metric 'package_throttles_total' is per package.
'package_throttles_total' is per package, not per cpu. This also reduces
the total number of cpu time series a lot (esp for multi core cpus).
* cpu: Better handling of a cpulist edge-case.
* cpu: Extract the package number from the directory name.
Do not rely on the range index.
* cpu: Add package_throttle_count for node0 cpu1
This file must be ignored by the cpu collector.
This avoids issues with integer overflows on 32-bit architectures. The
Prometheus data format is float64, so regardless of the architecture we
should handle large numbers.
Fixes#629.
* Add bcache collector for Linux
This collector gathers metrics related to the Linux block cache
(bcache) from sysfs.
* Removed commented out code
* Use project comment style
* Add _sectors to metric name to indicate unit
* Really use project comment style
* Rename bcache.go to bcache_linux.go
* Keep collector namespace clean
Rename:
- metric -> bcacheMetric
- periodStatsToMetrics -> bcachePeriodStatsToMetric
* Shorten slice initialization
* Change label names to backing_device, cache_device
* Remove five minute metrics (keep only total)
* Include units in additional metric names
* Enable bcache collector by default
* Provide metrics in seconds, not nanoseconds
* remove metrics with label "all"
* Add fixtures, update end-to-end for bcache collector
* Move fixtures/sys into tar.gz
This changeset moves the collector/fixtures/sys directory into
collector/fixtures/sys.tar.gz and tweaks the Makefile to unpack the
tarball before tests are run.
The reason for this change is that Windows does not allow colons in a
path (colons are present in some of the bcache fixture files), nor can
it (out of the box) deal with pathnames longer than 260 characters
(which we would be increasingly likely to hit if we tried to replace
colons with longer codes that are guaranteed not the turn up in regular
file names).
* Add ttar: plain text archive, replacement for tar
This changeset adds ttar, a plain text replacement for tar, and uses it
for the sysfs fixture archive. The syntax is loosely based on tar(1).
Using a plain text archive makes it possible to review changes without
downloading and extracting the archive. Also, when working on the repo,
git diff and git log become useful again, allowing a committer to verify
and track changes over time.
The code is written in bash, because bash is available out of the box on
all major flavors of Linux and on macOS. The feature set used is
restricted to bash version 3.2 because that is what Apple is still
shipping.
The programm also works on Windows if bash is installed. Obviously, it
does not solve the Windows limitations (path length limited to 260
characters, no symbolic links) that prompted the move to an archive
format in the first place.
* Add diskstats collector for Darwin
* Update year in the header
* Update README.md
* Add github.com/lufia/iostat to vendored packages
* Change stats to follow naming guidelines
* Add a entry of github.com/lufia/iostat into vendor.json
* Remove /proc/diskstats from description
* Add qdisc collector for Linux
This collector gathers basic queueing discipline metrics via netlink,
similarly to what `tc -s qdisc show` does.
* qdisc collector: nl-specific code moved, names fixed
- netlink-specific parts moved to github.com/ema/qdisc
- avoid using shortened names
- counters renamed into XXX_total
* Get rid of parseMessage error checking leftover
* Add github.com/ema/qdisc to vendored packages
* Update help texts and comments
* Add qdisc collector to README file
* qdisc collector end-to-end testing
* Update qdisc dependency to latest version
Update github.com/ema/qdisc dependency to revision 2c7e72d, which
includes unit testing.
* qdisc collector: rename "iface" label into "device"
According to Mellanox, it is standard practice that the port_xmit_data and port_rcv_data
files are split into 4 lanes. To get the actual transmit and receive values for each
port, the metric needs to be multiplied by 4.
Signed-Off-By: Robert Clark <robert.d.clark@hpe.com>
* silently ignore nonexisting bonding_masters file
Add an empty fixtures dir without a bonding_masters file to test.
* Moved the check to the Update() method
Dropped the empty test dir.
Since Go 1.8 32bit MIPS Big/Little Endian are supported assuming the
target runs Linux and the kernel either emulates an FPU or can access
the CPU one.
This allows the node_collector to build for mips and mipsle opening up
the possibility of running it on things like home routers
(DD-|Open|ASUS-)Wrt firmware usually has the necessary bits in place.