Commit graph

138 commits

Author SHA1 Message Date
liyang 505363a67d chore: add instance label in NodeHighNumberConntrackEntriesUsed alert description
Signed-off-by: liyang <ly18846162402@163.com>
2024-12-23 11:25:05 +01:00
Duologic 2fccdf4e17 fix(docs): add node(Warning|Critical)WindowHours to node-mixin
Signed-off-by: Duologic <jeroen@simplistic.be>
2024-12-23 11:24:50 +01:00
Tom d0c1d00d18
Migrate dashboards to new grafonnet library (#3147)
Some checks failed
golangci-lint / lint (push) Has been cancelled
Migrated away from deprecated Grafonnet library. This replaces panels using Angular JS which are disabled by default in Grafana 11 and will be unsupported in Grafana 12.

Fixes #3046

---------

Signed-off-by: Tom <12222103+critchtionary@users.noreply.github.com>
2024-12-19 16:49:22 +01:00
Jan Breitkopf a38a5d7b48
alerts: exclude iowait from NodeCPUHighUsage alert (#3203)
Signed-off-by: Jan Breitkopf <jan.breitkopf@prorocketeers.com>
2024-12-17 14:11:26 +01:00
Johannes Ziemke 92c10f9fd1 Add AIX dashboard
Signed-off-by: Johannes Ziemke <github@5pi.de>
2024-09-28 15:58:02 +02:00
Stefan Andres fe71568130
Add UIDs to dashboards (#3042)
Automatically add a uid to each dashboard.
This prevents changing URLs when restarting a grafana pod and
re-importing the dashboards via ConfigMaps.

Signed-off-by: Stefan Andres <sandres@anaconda.com>
2024-07-14 14:22:52 +02:00
looklose 7d4103c089 chore: fix typo in comment
Signed-off-by: looklose <shishuaiqun@yeah.net>
2024-04-10 14:24:02 +02:00
Adrian Berger cc49133321
Add multi-cluster support for Nodes dashboard (#2945)
Signed-off-by: Adrian Berger <adria.berger94@gmail.com>
2024-03-08 14:41:36 +01:00
Taylor Sly 9f9473859b
Fix description for NodeDiskIOSaturation alert (#2929)
NodeDiskIOSaturation description should say 30m per the "for" clause

Signed-off-by: Taylor Sly <slyt@users.noreply.github.com>
2024-02-16 08:58:22 +01:00
Anton Lugovoi 81fc05c45f
Make filesystem space prediction window configurable (#2844)
Signed-off-by: fitz123 <alugovoi@ordercapital.com>
2023-11-13 02:10:56 +01:00
Ayoub NASR 7333465abf
Add NodeBondingDegraded alert (#2843)
Signed-off-by: Ayoub Nasr <ayoub.nasr@scality.com>
2023-11-13 00:36:30 +01:00
Vitaly Zhuravlev e8d7f4e8b3 Revert alerts pending durtions
Signed-off-by: Vitaly Zhuravlev <v-zhuravlev@users.noreply.github.com>
2023-06-29 23:26:52 +08:00
Vitaly 3e250a95a0 Update NodeSystemSaturation severity
Signed-off-by: Vitaly Zhuravlev <v-zhuravlev@users.noreply.github.com>
2023-06-29 23:26:52 +08:00
Vitaly Zhuravlev b7dfb32bfc Set severity to NodeCPUHighUsage to info
Signed-off-by: Vitaly Zhuravlev <v-zhuravlev@users.noreply.github.com>
2023-06-29 23:26:52 +08:00
Vitaly Zhuravlev 6bdc1d9c98 Add thresholds for memory, disk and system alerts
Signed-off-by: Vitaly Zhuravlev <v-zhuravlev@users.noreply.github.com>
2023-06-29 23:26:52 +08:00
Vitaly Zhuravlev 77ae769179 Add thresholds for memory alerts
Signed-off-by: Vitaly Zhuravlev <v-zhuravlev@users.noreply.github.com>
2023-06-29 23:26:52 +08:00
Vitaly Zhuravlev 2111e70ac7 Add comma after 'mounted on'
Signed-off-by: Vitaly Zhuravlev <v-zhuravlev@users.noreply.github.com>
2023-06-29 23:26:52 +08:00
Vitaly Zhuravlev e48e7909f4 Extend alert description
Signed-off-by: Vitaly Zhuravlev <v-zhuravlev@users.noreply.github.com>
2023-06-29 23:26:52 +08:00
Vitaly Zhuravlev da32f8de17 Decrease NodeSystemdServiceFailed severity to warning
Signed-off-by: Vitaly Zhuravlev <v-zhuravlev@users.noreply.github.com>
2023-06-29 23:26:52 +08:00
Vitaly Zhuravlev 580c497261 Add NodeSystemSaturation and NodeMemoryMajorPagesFaults
Signed-off-by: Vitaly Zhuravlev <v-zhuravlev@users.noreply.github.com>
2023-06-29 23:26:52 +08:00
Vitaly Zhuravlev e15e7d6a7b Fix NodeMemoryHighUtilization alert
Signed-off-by: Vitaly Zhuravlev <v-zhuravlev@users.noreply.github.com>
2023-06-29 23:26:52 +08:00
Vitaly Zhuravlev c3ec6e8af1 Add diskDevice selector
Signed-off-by: Vitaly Zhuravlev <v-zhuravlev@users.noreply.github.com>
2023-06-29 23:26:52 +08:00
Vitaly Zhuravlev 962de6c921 Add %(nodeExporterSelector)s to Network and conntrack alerts
Signed-off-by: Vitaly Zhuravlev <v-zhuravlev@users.noreply.github.com>
2023-06-29 23:26:52 +08:00
Vitaly Zhuravlev 94fc82e418 Add NodeDiskIOSaturation alert
Signed-off-by: Vitaly Zhuravlev <v-zhuravlev@users.noreply.github.com>
2023-06-29 23:26:52 +08:00
Vitaly Zhuravlev 614030bb80 Set 'at' everywhere as preposition for instance
Signed-off-by: Vitaly Zhuravlev <v-zhuravlev@users.noreply.github.com>
2023-06-29 23:26:52 +08:00
Vitaly Zhuravlev 3d8075da7d Decrease NodeNetwork*Errs pending period
Signed-off-by: Vitaly Zhuravlev <v-zhuravlev@users.noreply.github.com>
2023-06-29 23:26:51 +08:00
Vitaly Zhuravlev 74794182a7 Add failed systemd service alert
Signed-off-by: Vitaly Zhuravlev <v-zhuravlev@users.noreply.github.com>
2023-06-29 23:26:51 +08:00
Vitaly Zhuravlev fd2d62af63 Add CPU and memory alerts
Signed-off-by: Vitaly Zhuravlev <v-zhuravlev@users.noreply.github.com>
2023-06-29 23:26:51 +08:00
Vitaly Zhuravlev 0e0399d41e Decrease NodeFilesystem pending time to 15m
30m is too long and there is a risk of running out of disk space/inodes completely if something is filling up disk very fast (like log file).

Signed-off-by: Vitaly Zhuravlev <v-zhuravlev@users.noreply.github.com>
2023-06-29 23:26:51 +08:00
Vitaly Zhuravlev fc967aa992 Add mountpoint to NodeFilesystem alerts
This helps to identify alerting filesystem.

Signed-off-by: Vitaly Zhuravlev <v-zhuravlev@users.noreply.github.com>
2023-06-29 23:26:51 +08:00
Will Bollock 0a17e17718
docs (node/mixin): fix annotation for Skew alert (#2671)
This updates the annotation for the NodeClockSkewDetected mixin alert to
match the new threshold set.

Original discussion was in this PR: https://github.com/prometheus/node_exporter/pull/1480

I spent an embarrassingly large amount of time trying to figure out how
the heck that alert would mean 300s of clock skew. Turns out the
annotation was just left the same after the threshold change.

Signed-off-by: Will Bollock <wbollock@linode.com>
2023-05-11 10:33:10 +02:00
Ben Kochie c8705ec4b2
Deprecate ntp collector
The ntp collector has always been a source of confusion and problems.
The data it produces is more of a blackbox probe against an NTP server.
The time sync / offset data produced is not what users expect.

Mark this collector as deprecated to be removed in v2.0.0

Signed-off-by: Ben Kochie <superq@gmail.com>
2023-02-16 09:27:38 +01:00
Ryan J. Geyer 5e552bac02 Replace mistaken ) with }, resulting in parsable promql
Signed-off-by: Ryan J. Geyer <me@ryangeyer.com>
2022-12-13 13:30:42 +01:00
Jan Fajerski 87b8e3790d
docs/node-mixin: add fsMointpointSelector to alerts and dashboards (#2446)
* docs/node-mixin: add fsMountpointSelector

This adds the option to add a `mountpoint` selector to filesystem
related alerts. The default is `mountpoint!=""`.

* docs/node-mixins: add fsMountpointSelector to dashboards

Signed-off-by: Jan Fajerski <jfajersk@redhat.com>
2022-10-20 13:06:31 +02:00
Siavash Sefid Rodi f40dd31780 Fix CPU renaming rule
Signed-off-by: Florian Best <best@univention.de>
2022-07-27 13:16:00 +02:00
Vitaly Zhuravlev 7519830a8a Change io time units to %util
When appying rate() to seconds we have 'seconds per second' or fractions of the second, so actually it actually can be from 0 to 1.

Also update intervalFactor to 1 for better rates.

Signed-off-by: Vitaly Zhuravlev <zhuravlev.vitaly@gmail.com>
2022-07-26 11:09:43 +02:00
Vitaly Zhuravlev 469600f4bf Update units of network ad disk graphs
https://prometheus.io/docs/prometheus/latest/querying/functions/#rate

rate() calculates per-second average rate, therefore Bps units should be used for disks.

In networking bandwidth throughput is usually measured in bits/s so units are changed accordingly.

Signed-off-by: Vitaly Zhuravlev <zhuravlev.vitaly@gmail.com>
2022-07-26 11:09:43 +02:00
Albert Mikaelyan cee386678c fix compatibility rule to convert to old node_cpu metric
Signed-off-by: Albert Mikaelyan <tahvok@gmail.com>
2022-07-25 18:54:53 +02:00
Paweł Krupa (paulfantom) 8571536327 docs/node-mixin: add missing selectors
Signed-off-by: Paweł Krupa (paulfantom) <pawel@krupa.net.pl>
2022-07-19 16:44:16 +02:00
Sven Kieske d64766f43d
fix the following markdownlint issues (#2362)
fix the following markdownlint errors (and some more):

[..]mixins/node-exporter/README.md:13: MD031 Fenced code blocks should be surrounded by blank lines
[..]mixins/node-exporter/README.md:21: MD031 Fenced code blocks should be surrounded by blank lines
[..]mixins/node-exporter/README.md:27: MD031 Fenced code blocks should be surrounded by blank lines
[..]mixins/node-exporter/README.md:33: MD031 Fenced code blocks should be surrounded by blank lines
[..]mixins/node-exporter/README.md:41: MD034 Bare URL used
A detailed description of the rules is available at https://github.com/markdownlint/markdownlint/blob/master/docs/RULES.md

Signed-off-by: Sven Kieske <s.kieske@mittwald.de>
2022-06-28 05:50:06 +02:00
Björn Rabenstein e5128e83f2
Merge pull request #2364 from grafana/vzhuravlev/fs_table
mixin: Change disk graph to disk table
2022-06-08 20:46:47 +02:00
Jan Fajerski cec414df78 node-mixins/config: Switch fsAvailable warning and critical thresholds
Problem: In 0b50eb7294 the usage of the
threshold variables was adjusted. The values had been switched as well
resulting in reversed thresholds after the commit above. Warnings now
have a smaller threshold than critical alerts.

Solution: Adjust thresholds to reflect that warnings should be alerted
on before critical alerts.

Issues: https://github.com/prometheus/node_exporter/pull/2352

Signed-off-by: Jan Fajerski <jfajersk@redhat.com>
2022-06-07 12:10:48 +02:00
Björn Rabenstein b5a2ad46e3
Merge pull request #2351 from grafana/vzhuravlev/macos
Add darwin dashboard
2022-05-03 12:59:29 +02:00
Vitaly Zhuravlev eef827006a Change disk graph to disk table
Signed-off-by: Vitaly Zhuravlev <zhuravlev.vitaly@gmail.com>
2022-04-27 19:15:50 +04:00
Daniel Lenar 0b50eb7294 Reverse fsSpaceAvailableCriticalThreshold and fsSpaceAvailableWarningThreshold
Currently critical alert for space available alerts on warning and
warning alert for space available alerts on critical.

Signed-off-by: Daniel Lenar <dlenar@vailsys.com>
2022-04-21 11:34:54 -05:00
Gabriel Amaral Antunes 410e069471 Add darwin dashboard to mixin
Signed-off-by: Vitaly Zhuravlev <zhuravlev.vitaly@gmail.com>
2022-04-20 15:18:43 +04:00
Vitaly Zhuravlev 8823605f12 Fix NodeFileDescriptorLimit alerts
Signed-off-by: Vitaly Zhuravlev <zhuravlev.vitaly@gmail.com>
2022-04-07 16:25:17 +04:00
Severyn Lisovskyi 7b86b7cb29
[node-mixin] change current datasource to grafana's default
Signed-off-by: Severyn Lisovskyi <993215+sev3ryn@users.noreply.github.com>
2022-02-02 14:45:26 +01:00
Julian Wiedmann 3e6f4ce627
mixin: exclude iowait and steal from CPU Utilisation (#2194)
'iowait' and 'steal' indicate specific idle/wait states, which shouldn't
be counted into CPU Utilisation. Also see
https://github.com/prometheus-operator/kube-prometheus/pull/796 and
https://github.com/kubernetes-monitoring/kubernetes-mixin/pull/667.

Per the iostat man page:

%idle
    Show the percentage of time that the CPU or CPUs were idle and the
    system did not have an outstanding disk I/O request.

%iowait
     Show the percentage of time that the CPU or CPUs were idle during
     which the system had an outstanding disk I/O request.

%steal
     Show the percentage of time spent in involuntary wait by the
     virtual CPU or CPUs while the hypervisor was servicing another
     virtual processor.

Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
2021-11-04 11:03:27 +01:00
Ben Kochie 421fc429f3
Replace deprecated linter (#2176)
Upstream is replacing `golint` with `revive`.
* Cleanup unused mixin go files.

Signed-off-by: Ben Kochie <superq@gmail.com>
2021-10-27 11:01:15 +02:00