Avoid metrics with inconsistent help-texts. The earlier behaviour has
been preserved in the sense that the first encountered instance is still
used to generate metrics, whereas the subsequent inconsistent ones are
ignored along with a few peripheral changes.
```
# HELP node_scrape_collector_duration_seconds node_exporter: Duration of a collector scrape.
#TYPE node_scrape_collector_duration_seconds gauge
node_scrape_collector_duration_seconds{collector="textfile"} 0.0004005
# HELP node_scrape_collector_success node_exporter: Whether a collector succeeded.
# TYPE node_scrape_collector_success gauge
node_scrape_collector_success{collector="textfile"} 1
# HELP node_textfile_mtime_seconds Unixtime mtime of textfiles successfully read.
# TYPE node_textfile_mtime_seconds gauge
node_textfile_mtime_seconds{file="/Users/rexagod/repositories/misc/node_exporter/ne-bar.prom"} 1.710812009e+09
node_textfile_mtime_seconds{file="/Users/rexagod/repositories/misc/node_exporter/ne-foo.prom"} 1.710811982e+09
# HELP node_textfile_scrape_error 1 if there was an error opening or reading a file, 0 otherwise
# TYPE node_textfile_scrape_error gauge
node_textfile_scrape_error 1
# HELP promhttp_metric_handler_errors_total Total number of internal errors encountered by the promhttp metric handler.
# TYPE promhttp_metric_handler_errors_total counter
promhttp_metric_handler_errors_total{cause="encoding"} 0
promhttp_metric_handler_errors_total{cause="gathering"} 0
# HELP promhttp_metric_handler_requests_in_flight Current number of scrapes being served.
# TYPE promhttp_metric_handler_requests_in_flight gauge
promhttp_metric_handler_requests_in_flight 1
# HELP promhttp_metric_handler_requests_total Total number of scrapes by HTTP status code.
# TYPE promhttp_metric_handler_requests_total counter
promhttp_metric_handler_requests_total{code="200"} 0
promhttp_metric_handler_requests_total{code="500"} 0
promhttp_metric_handler_requests_total{code="503"} 0
# HELP tau_infrastructure_performing_maintenance_task At what timestamp a given task started or stopped, the last time it was run.
# TYPE tau_infrastructure_performing_maintenance_task gauge
tau_infrastructure_performing_maintenance_task{main_task="nightly",start_or_stop="start",sub_task="main"} 1.64728080198446e+09
```
Fixes: #2317
Signed-off-by: Pranshu Srivastava <rexagod@gmail.com>
Apply the same metric name sanitization to the keys as to the metric
names. This avoids conflicting help strings in the metric registry.
Fixes: https://github.com/prometheus/node_exporter/issues/2893
Signed-off-by: Ben Kochie <superq@gmail.com>
Fix golangci-lint "ineffectual assignment" by correctly capturing any
errors within the hwmon gathering loop.
Signed-off-by: Ben Kochie <superq@gmail.com>
While the CPU vulnerabilities collector has been added in https://github.com/prometheus/node_exporter/pull/2721 , it's currently not including information regarding the mitigation strategy used for a given vulnerability.
This information can be quite valuable, as often times different mitigation strategies come with a different performance impact.
This commit adds a third label to the cpu_vulnerabilities_info metric, to include the "mitigation" used for a given vulnerability - if a given vulnerability is not affecting a node or the node is still vulnerable, the mitigation is expected to be empty.
Signed-off-by: João Lima <jlima@cloudflare.com>
Adds a count for TCP packets received out of orders. This can be an
indication that there is packet loss on the way packets travel towards
this server. In that case, the sender will retransmit (and we can
already monitor the Tcp_RetransSegs there), but we have no way to
monitor the packet loss on the receiver side. When a packet is received
and the receiver detects previous one missing, it will increase the
TCPOFOQueue counter and reply with selective ACK to the sender, both
possible indications of packet loss. Confirmation of packet loss can be
achieved by taking packet captures, ignoring wireshark analysis, and
carefully looking at data being retransmitted based on the TCP seq.
Just like RetransSegs, TCPOFOQueue should be interesting for any
deployment as a mean to detect packet loss, so here suggesting adding it
to the default list.
Signed-off-by: François Rigault <frigo@amadeus.com>
Co-authored-by: François Rigault <frigo@amadeus.com>
This attribute was introduced it v6.6-rc1.
The relevant changes in procfs were merged here:
https://github.com/prometheus/procfs/pull/574
and are part of procfs v0.11.2
I have also figured out that the stat should be part of the v4 ops
counters struct, but that will need changes to both procfs and this
code. Since people are already using 6.6-rc1, I think it's better to get
the code out there --- even if they don't care about wdeleg_getattr,
currently they get _no_ nfsd stats with 6.6-rc1.
I will make two follow-up PRs to clean this up in the next releases of
procfs and node-exporter.
Signed-off-by: Tobias Klausmann <klausman@schwarzvogel.de>
* Rename parsePoolObjsetFile to parseLinuxPoolObjsetFile to better reflect
it's scope
* Create a new parseFreeBSDPoolObjsetStats function, to generate a list
of per pool metrics to be queried via sysctl
---------
Signed-off-by: Conall O'Brien <conall@conall.net>
* Optionally fetch ARP stats via rtnetlink instead of procfs
Implement collection of ARP stats via rtnetlink to work around
shortcomings in the output of /proc/net/arp, which truncates InfiniBand
link-layer addresses.
Fixes: #2776
---------
Signed-off-by: Daniel Swarbrick <daniel.swarbrick@gmail.com>
Co-authored-by: Ben Kochie <superq@gmail.com>
Despite being quite hard to provoke (< 10% in my testing), the btrfs
collector would occasionally leave stale FDs relating to btrfs
mountpoints, making the filesystems unable to be unmounted.
Fixes: #2772.
Signed-off-by: Daniel Swarbrick <daniel.swarbrick@gmail.com>
Revert changes to node_cpu_info and add new node_cpu_frequency_hertz
metric for measuring CPU frequency from /proc/cpuinfo
Signed-off-by: John Kordich <jkordich@gmail.com>
For CPUs which don't have an available (or insertable) cpufreq driver,
the /proc/cpuinfo file can sometimes have accurate CPU core frequency
measurements. This change replaces the constant value of "1" for the
"node_cpu_info" metric with the parsed CPU MHz value from
/proc/cpuinfo for each core.
Signed-off-by: John Kordich <jkordich@gmail.com>
Ensure that unwanted tests are correctly excluded when various build
tags are specified, i.e. when the code that they test would be excluded
from compilation.
Signed-off-by: Daniel Swarbrick <daniel.swarbrick@gmail.com>
Drop redundant GOOS build tags at start of file if the constraint is
already specified by the filename, e.g. foo_GOOS.go or
foo_GOOS_GOARCH.go, avoiding potential confusion in future.
cf. https://pkg.go.dev/cmd/go#hdr-Build_constraints
Signed-off-by: Daniel Swarbrick <daniel.swarbrick@gmail.com>
On some platforms, `msg.Attributes.Stats64` is `nil` because the kernel doesn't
expose 64-bit stats. In that case, return `msg.Attributes.Stats` instead, which
are the 32-bit equivalent.
Note that `RXOtherhostDropped` isn't available in that case, so we hardcode it
to zero.
Fixes#2756.
Signed-off-by: Benoît Knecht <bknecht@protonmail.ch>
Use the correct include value to the device filter function.
* Add new bogus hwmon fixture.
* Update end-to-end test to use hwmon chip include flag.
Signed-off-by: Ben Kochie <superq@gmail.com>
prefix.
Leave an annotation about using regexps instead of device_filter.go, so
@SuperQ doesn't need to remember everything.
Signed-off-by: Conall O'Brien <conall@conall.net>
* Add include and exclude flags chip name flags to hwmon collector, following example in systemd collector
---------
Signed-off-by: Conall O'Brien <conall@conall.net>
Co-authored-by: Ben Kochie <superq@gmail.com>
This change adds the ability to process multiple stat calls in parallel.
Processing is rate-limited based on the new flag
`collector.filesystem.stat-workers` (default 4).
Caveat: filesystem stats information is no longer in the same order as
returned by `/proc/1/mounts`. This should not be an issue.
Caveat: This change currently uses unbuffered channels to prove
correctness without reliance on buffers. Buffered channels will yield
superior performance.
Signed-off-by: Erica Mays <erica@emays.dev>
Read missing dev_id, name_assign_type, and addr_assign_type
from sysfs, since they only take a device-specific lock and
not the whole RTNL lock. This means reading them is much less
impactful on other system processes than many of the other
attributes in sysfs that do take the RTNL lock.
Signed-off-by: Dan Williams <dcbw@redhat.com>
On most hard drives, `ID_SERIAL_SHORT` and `SCSI_IDENT_SERIAL` are identical,
but on some SAS drives they do differ. In that case, `SCSI_IDENT_SERIAL`
corresponds to the serial number printed on the drive label, and to the value
returned by `smartctl -i`.
So use that value by default for the `serial` label on the `node_disk_info`
metric, and fallback to `ID_SERIAL_SHORT` only if it's undefined.
Signed-off-by: Benoît Knecht <bknecht@protonmail.ch>
Mark the `supervisord` as deprecated. This process
supevisor, like `runit`, is of scope for the node_exporter.
Signed-off-by: Ben Kochie <superq@gmail.com>
* bcache: remove cache_readaheads_totals metrics #2103
Signed-off-by: Saleh Sal <0xack13@gmail.com>
* Append bcacheReadaheadMetrics when CacheReadaheads value exists
Signed-off-by: Saleh Sal <0xack13@gmail.com>
* Update test cases for cachereadahead greater than zero
Signed-off-by: Saleh Sal <0xack13@gmail.com>
---------
Signed-off-by: Saleh Sal <0xack13@gmail.com>
Use the filesystem collector for all OpenBSD archs, there is no reason to
only use it on amd64 systems.
Signed-off-by: Claudio Jeker <claudio@openbsd.org>
* Bump exporter-toolkit to the latest release.
* Use new toolkit landing page function.
* Update kingpin flags.
Signed-off-by: Ben Kochie <superq@gmail.com>
The ntp collector has always been a source of confusion and problems.
The data it produces is more of a blackbox probe against an NTP server.
The time sync / offset data produced is not what users expect.
Mark this collector as deprecated to be removed in v2.0.0
Signed-off-by: Ben Kochie <superq@gmail.com>
Move metric descriptiions to package vars to avoid allocating them every
time `NewCPUFreqCollector()` is called.
Signed-off-by: Ben Kochie <superq@gmail.com>
* Refactor netclass_rtnl collector
Merge the netclass_rtnl collector into the netclass collector.
* Disabled by default
* Followup to #2492
Signed-off-by: Ben Kochie <superq@gmail.com>
* update rtnetlink package to v1.2.3
* add RTNL version of netclass collector that have all the metrics that netdev collector provides, too.
Signed-off-by: Haoyu Sun <hasun@redhat.com>
Some systems have broken netlink messages due to patched kernels. Since
these messages can not be parsed, add a flag to fall back to parsing
from `/proc/net/dev`.
Fixes: https://github.com/prometheus/node_exporter/issues/2502
Signed-off-by: Ben Kochie <superq@gmail.com>
Signed-off-by: Ben Kochie <superq@gmail.com>
Note however that the InetDiagMsg struct contains a InetDiagSockID
member, which itself contains some members which are explicitly
specified as big-endian in Linux kernel source:
struct inet_diag_sockid {
__be16 idiag_sport;
__be16 idiag_dport;
__be32 idiag_src[4];
__be32 idiag_dst[4];
__u32 idiag_if;
__u32 idiag_cookie[2];
};
node_exporter currently does not use these members for anything, so this
is acceptable (for now).
Signed-off-by: Daniel Swarbrick <daniel.swarbrick@gmail.com>
We don't need to fully sanitize the hwmon label values to metric/label
name strings.
* Just make sure they're valid UTF-8.
* Always included the label metric to avoid group_left failures.
Signed-off-by: Ben Kochie <superq@gmail.com>
Signed-off-by: Ben Kochie <superq@gmail.com>
Correctly handle the new `collector.diskstats.device-exclude` flag to
avoid errors when using the old `collector.diskstats.ignored-devices`
flag.
Fixes: https://github.com/prometheus/node_exporter/issues/2486
Signed-off-by: Ben Kochie <superq@gmail.com>
* [CHANGE] Merge metrics descriptions in textfile collector #2475
* [FEATURE] [node-mixin] Add darwin dashboard to mixin #2351
* [FEATURE] Add "isolated" metric on cpu collector on linux #2251
* [FEATURE] Add cgroup summary collector #2408
* [FEATURE] Add selinux collector #2205
* [FEATURE] Add slab info collector #2376
* [FEATURE] Add sysctl collector #2425
* [FEATURE] Also track the CPU Spin time for OpenBSD systems #1971
* [FEATURE] Add support for MacOS version #2471
* [ENHANCEMENT] [node-mixin] Add missing selectors #2426
* [ENHANCEMENT] [node-mixin] Change current datasource to grafana's default #2281
* [ENHANCEMENT] [node-mixin] Change disk graph to disk table #2364
* [ENHANCEMENT] [node-mixin] Change io time units to %util #2375
* [ENHANCEMENT] Ad user_wired_bytes and laundry_bytes on *bsd #2266
* [ENHANCEMENT] Add additional vm_stat memory metrics for darwin #2240
* [ENHANCEMENT] Add device filter flags to arp collector #2254
* [ENHANCEMENT] Add diskstats include and exclude device flags #2417
* [ENHANCEMENT] Add node_softirqs_total metric #2221
* [ENHANCEMENT] Add rapl zone name label option #2401
* [ENHANCEMENT] Add slabinfo collector #1799
* [ENHANCEMENT] Allow user to select port on NTP server to query #2270
* [ENHANCEMENT] collector/diskstats: Add labels and metrics from udev #2404
* [ENHANCEMENT] Enable builds against older macOS SDK #2327
* [ENHANCEMENT] qdisk-linux: Add exclude and include flags for interface name #2432
* [ENHANCEMENT] systemd: Expose systemd minor version #2282
* [ENHANCEMENT] Use netlink for tcpstat collector #2322
* [ENHANCEMENT] Use netlink to get netdev stats #2074
* [ENHANCEMENT] Add additional perf counters for stalled frontend/backend cycles #2191
* [ENHANCEMENT] Add btrfs device error stats #2193
* [BUGFIX] [node-mixin] Fix fsSpaceAvailableCriticalThreshold and fsSpaceAvailableWarning #2352
* [BUGFIX] Fix concurrency issue in ethtool collector #2289
* [BUGFIX] Fix concurrency issue in netdev collector #2267
* [BUGFIX] Fix diskstat reads and write metrics for disks with different sector sizes #2311
* [BUGFIX] Fix iostat on macos broken by deprecation warning #2292
* [BUGFIX] Fix NodeFileDescriptorLimit alerts #2340
* [BUGFIX] Sanitize rapl zone names #2299
* [BUGFIX] Add file descriptor close safely in test #2447
* [BUGFIX] Fix race condition in os_release.go #2454
* [BUGFIX] Skip ZFS IO metrics if their paths are missing #2451
Signed-off-by: Ben Kochie <superq@gmail.com>
Signed-off-by: Ben Kochie <superq@gmail.com>
* Improve metrics filesystem scanning logic
* Makes ioctl syscalls to load the device error stats.
* Adds filesystem mountpoint labels to existing metrics for ease of use.
Signed-off-by: Marcus Cobden <leth@users.noreply.github.com>
The textfile collector will now provide a unified metric description
(that will look like "Metric read from file/a.prom, file/b.prom")
for metrics collected accross several text-files that don't already
have a description.
Also change the error handling in the textfile collector tests to
ContinueOnError to better mirror the real-life use-case.
Signed-off-by: Guillaume Espanel <guillaume.espanel.ext@ovhcloud.com>
Signed-off-by: Guillaume Espanel <guillaume.espanel.ext@ovhcloud.com>
* Allow user to select port on NTP server to query
Some people (me!) run NTP servers on non-privileged ports. The `github.com/beevik/ntp` package allows overriding the port, so this change just adds a flag `collector.ntp.server-port` (defaults to 123) and then passes that value through to the query via the `QueryOptions`.
Signed-off-by: Andrew Rowson <github@growse.com>