In order to reduce cardinality of the interrupts collector add
filtering options
* Add include/exclude regexp filter flags.
* Add boolean flag to include zero values, enabled by default.
Signed-off-by: Ben Kochie <superq@gmail.com>
* collector/zfs: Prevent `procfs` integer underflow
Prevent integer underflow when parsing the `procfs` file as it used a
`ParseUint` to parse signed values.
Fixes: #2766
---------
Signed-off-by: Pranshu Srivastava <rexagod@gmail.com>
* ref!: convert linux meminfo implementation to use procfs lib
Part of #2957
Prometheus' procfs lib supports collecting memory info and we're using a
new enough version of the lib that has it available, so this converts
the meminfo collector for Linux to use data from procfs lib instead. The
bits I've touched for darwin/openbsd/netbsd are with intent to preserve
the original struct implementation/backwards compatibility.
Signed-off-by: TJ Hoplock <t.hoplock@gmail.com>
* fix: meminfo debug log unsupported value
Fixes:
```
ts=2024-06-11T19:04:55.591Z caller=meminfo.go:44 level=debug collector=meminfo msg="Set node_mem" memInfo="unsupported value type"
```
Signed-off-by: TJ Hoplock <t.hoplock@gmail.com>
* fix: don't coerce nil Meminfo entries to 0, leave out if nil
Nil entries in procfs.Meminfo fields indicate that the value isn't
present on the system. Coercing those nil values to `0` introduces new
metrics on systems that should not be present and can break some
queries.
Addresses PR feedback:
https://github.com/prometheus/node_exporter/pull/3049#discussion_r1637581536https://github.com/prometheus/node_exporter/pull/3049#discussion_r1637584482
Signed-off-by: TJ Hoplock <t.hoplock@gmail.com>
---------
Signed-off-by: TJ Hoplock <t.hoplock@gmail.com>
* Add include and exclude filter for sensors in hwmon collector
Fixes#2242
This commit adds two new flags (`collector.hwmon.sensor-include` and `collector.hwmon.sensor-exclude`) to the `hwmon` collector to allow inclusion or exclusion of specific sensors.
Some devices export nonsensical values for certain sensors. Here is an example:
```
node_hwmon_temp_celsius{chip="platform_nct6775_656",sensor="temp13"} 49.75
node_hwmon_temp_celsius{chip="platform_nct6775_656",sensor="temp15"} 3.892313987e+06
node_hwmon_temp_celsius{chip="platform_nct6775_656",sensor="temp16"} 3.892313987e+06
```
As a user I would like to only exclude these sensors, not necessarily the complete device (as is currently possible with the `--collector.hwmon.chip-exclude` flag) as other sensor values might be sensical or desired.
The new option filters based both on device name and sensor name, separated by a semicolon. For example, to exclude the two sensors above, the following regex can be used:
~~~
--collector.hwmon.sensor-exclude="platform_nct6775_656;temp1[5,6]"
~~~
---------
Signed-off-by: Simon Krenger <skrenger@redhat.com>
Running "go test" in the collector directory, without the fixtures
available, results in multiple panics, including `SIGSEGV`. Most of
these are due to incorrect error handling. This cleans them up.
Signed-off-by: Benny Siegert <bsiegert@gmail.com>
Check that the PSI metrics are returned in order to avoid nil pointer
dereference.
* Update fixutre to match real-world samples.
Fixes: https://github.com/prometheus/node_exporter/issues/3015
Signed-off-by: Ben Kochie <superq@gmail.com>
Replace all cpu_ticks_* with cpu_nsec_*, since the former was off my a
magnitude of 10e6, and showed incorrect values for
node_cpu_seconds_total.
Fixes: #1837
Signed-off-by: Pranshu Srivastava <rexagod@gmail.com>
* Update Go to 1.22.
* Update Go modules.
* Use new version collector.
* Use standard library slices package.
Signed-off-by: Ben Kochie <superq@gmail.com>
When the zfs collector fails on FreeBSD it doesn't log which `mib` triggered the issue. This makes diagnostics hard.
Incompatibilities in the list of supported mibs is not uncommon with major os updates. By adding this change, it'll be easier for users to report the specific mib that is triggering the failure.
Related to #2847
Signed-off-by: Daniel Kimsey <90741+dekimsey@users.noreply.github.com>
Avoid metrics with inconsistent help-texts. The earlier behaviour has
been preserved in the sense that the first encountered instance is still
used to generate metrics, whereas the subsequent inconsistent ones are
ignored along with a few peripheral changes.
```
# HELP node_scrape_collector_duration_seconds node_exporter: Duration of a collector scrape.
#TYPE node_scrape_collector_duration_seconds gauge
node_scrape_collector_duration_seconds{collector="textfile"} 0.0004005
# HELP node_scrape_collector_success node_exporter: Whether a collector succeeded.
# TYPE node_scrape_collector_success gauge
node_scrape_collector_success{collector="textfile"} 1
# HELP node_textfile_mtime_seconds Unixtime mtime of textfiles successfully read.
# TYPE node_textfile_mtime_seconds gauge
node_textfile_mtime_seconds{file="/Users/rexagod/repositories/misc/node_exporter/ne-bar.prom"} 1.710812009e+09
node_textfile_mtime_seconds{file="/Users/rexagod/repositories/misc/node_exporter/ne-foo.prom"} 1.710811982e+09
# HELP node_textfile_scrape_error 1 if there was an error opening or reading a file, 0 otherwise
# TYPE node_textfile_scrape_error gauge
node_textfile_scrape_error 1
# HELP promhttp_metric_handler_errors_total Total number of internal errors encountered by the promhttp metric handler.
# TYPE promhttp_metric_handler_errors_total counter
promhttp_metric_handler_errors_total{cause="encoding"} 0
promhttp_metric_handler_errors_total{cause="gathering"} 0
# HELP promhttp_metric_handler_requests_in_flight Current number of scrapes being served.
# TYPE promhttp_metric_handler_requests_in_flight gauge
promhttp_metric_handler_requests_in_flight 1
# HELP promhttp_metric_handler_requests_total Total number of scrapes by HTTP status code.
# TYPE promhttp_metric_handler_requests_total counter
promhttp_metric_handler_requests_total{code="200"} 0
promhttp_metric_handler_requests_total{code="500"} 0
promhttp_metric_handler_requests_total{code="503"} 0
# HELP tau_infrastructure_performing_maintenance_task At what timestamp a given task started or stopped, the last time it was run.
# TYPE tau_infrastructure_performing_maintenance_task gauge
tau_infrastructure_performing_maintenance_task{main_task="nightly",start_or_stop="start",sub_task="main"} 1.64728080198446e+09
```
Fixes: #2317
Signed-off-by: Pranshu Srivastava <rexagod@gmail.com>
Apply the same metric name sanitization to the keys as to the metric
names. This avoids conflicting help strings in the metric registry.
Fixes: https://github.com/prometheus/node_exporter/issues/2893
Signed-off-by: Ben Kochie <superq@gmail.com>
Fix golangci-lint "ineffectual assignment" by correctly capturing any
errors within the hwmon gathering loop.
Signed-off-by: Ben Kochie <superq@gmail.com>
While the CPU vulnerabilities collector has been added in https://github.com/prometheus/node_exporter/pull/2721 , it's currently not including information regarding the mitigation strategy used for a given vulnerability.
This information can be quite valuable, as often times different mitigation strategies come with a different performance impact.
This commit adds a third label to the cpu_vulnerabilities_info metric, to include the "mitigation" used for a given vulnerability - if a given vulnerability is not affecting a node or the node is still vulnerable, the mitigation is expected to be empty.
Signed-off-by: João Lima <jlima@cloudflare.com>
Adds a count for TCP packets received out of orders. This can be an
indication that there is packet loss on the way packets travel towards
this server. In that case, the sender will retransmit (and we can
already monitor the Tcp_RetransSegs there), but we have no way to
monitor the packet loss on the receiver side. When a packet is received
and the receiver detects previous one missing, it will increase the
TCPOFOQueue counter and reply with selective ACK to the sender, both
possible indications of packet loss. Confirmation of packet loss can be
achieved by taking packet captures, ignoring wireshark analysis, and
carefully looking at data being retransmitted based on the TCP seq.
Just like RetransSegs, TCPOFOQueue should be interesting for any
deployment as a mean to detect packet loss, so here suggesting adding it
to the default list.
Signed-off-by: François Rigault <frigo@amadeus.com>
Co-authored-by: François Rigault <frigo@amadeus.com>
This attribute was introduced it v6.6-rc1.
The relevant changes in procfs were merged here:
https://github.com/prometheus/procfs/pull/574
and are part of procfs v0.11.2
I have also figured out that the stat should be part of the v4 ops
counters struct, but that will need changes to both procfs and this
code. Since people are already using 6.6-rc1, I think it's better to get
the code out there --- even if they don't care about wdeleg_getattr,
currently they get _no_ nfsd stats with 6.6-rc1.
I will make two follow-up PRs to clean this up in the next releases of
procfs and node-exporter.
Signed-off-by: Tobias Klausmann <klausman@schwarzvogel.de>