Commit graph

105 commits

Author SHA1 Message Date
Vitaly Zhuravlev 2111e70ac7 Add comma after 'mounted on'
Signed-off-by: Vitaly Zhuravlev <v-zhuravlev@users.noreply.github.com>
2023-06-29 23:26:52 +08:00
Vitaly Zhuravlev e48e7909f4 Extend alert description
Signed-off-by: Vitaly Zhuravlev <v-zhuravlev@users.noreply.github.com>
2023-06-29 23:26:52 +08:00
Vitaly Zhuravlev da32f8de17 Decrease NodeSystemdServiceFailed severity to warning
Signed-off-by: Vitaly Zhuravlev <v-zhuravlev@users.noreply.github.com>
2023-06-29 23:26:52 +08:00
Vitaly Zhuravlev 580c497261 Add NodeSystemSaturation and NodeMemoryMajorPagesFaults
Signed-off-by: Vitaly Zhuravlev <v-zhuravlev@users.noreply.github.com>
2023-06-29 23:26:52 +08:00
Vitaly Zhuravlev e15e7d6a7b Fix NodeMemoryHighUtilization alert
Signed-off-by: Vitaly Zhuravlev <v-zhuravlev@users.noreply.github.com>
2023-06-29 23:26:52 +08:00
Vitaly Zhuravlev c3ec6e8af1 Add diskDevice selector
Signed-off-by: Vitaly Zhuravlev <v-zhuravlev@users.noreply.github.com>
2023-06-29 23:26:52 +08:00
Vitaly Zhuravlev 962de6c921 Add %(nodeExporterSelector)s to Network and conntrack alerts
Signed-off-by: Vitaly Zhuravlev <v-zhuravlev@users.noreply.github.com>
2023-06-29 23:26:52 +08:00
Vitaly Zhuravlev 94fc82e418 Add NodeDiskIOSaturation alert
Signed-off-by: Vitaly Zhuravlev <v-zhuravlev@users.noreply.github.com>
2023-06-29 23:26:52 +08:00
Vitaly Zhuravlev 614030bb80 Set 'at' everywhere as preposition for instance
Signed-off-by: Vitaly Zhuravlev <v-zhuravlev@users.noreply.github.com>
2023-06-29 23:26:52 +08:00
Vitaly Zhuravlev 3d8075da7d Decrease NodeNetwork*Errs pending period
Signed-off-by: Vitaly Zhuravlev <v-zhuravlev@users.noreply.github.com>
2023-06-29 23:26:51 +08:00
Vitaly Zhuravlev 74794182a7 Add failed systemd service alert
Signed-off-by: Vitaly Zhuravlev <v-zhuravlev@users.noreply.github.com>
2023-06-29 23:26:51 +08:00
Vitaly Zhuravlev fd2d62af63 Add CPU and memory alerts
Signed-off-by: Vitaly Zhuravlev <v-zhuravlev@users.noreply.github.com>
2023-06-29 23:26:51 +08:00
Vitaly Zhuravlev 0e0399d41e Decrease NodeFilesystem pending time to 15m
30m is too long and there is a risk of running out of disk space/inodes completely if something is filling up disk very fast (like log file).

Signed-off-by: Vitaly Zhuravlev <v-zhuravlev@users.noreply.github.com>
2023-06-29 23:26:51 +08:00
Vitaly Zhuravlev fc967aa992 Add mountpoint to NodeFilesystem alerts
This helps to identify alerting filesystem.

Signed-off-by: Vitaly Zhuravlev <v-zhuravlev@users.noreply.github.com>
2023-06-29 23:26:51 +08:00
Will Bollock 0a17e17718
docs (node/mixin): fix annotation for Skew alert (#2671)
This updates the annotation for the NodeClockSkewDetected mixin alert to
match the new threshold set.

Original discussion was in this PR: https://github.com/prometheus/node_exporter/pull/1480

I spent an embarrassingly large amount of time trying to figure out how
the heck that alert would mean 300s of clock skew. Turns out the
annotation was just left the same after the threshold change.

Signed-off-by: Will Bollock <wbollock@linode.com>
2023-05-11 10:33:10 +02:00
Ryan J. Geyer 5e552bac02 Replace mistaken ) with }, resulting in parsable promql
Signed-off-by: Ryan J. Geyer <me@ryangeyer.com>
2022-12-13 13:30:42 +01:00
Jan Fajerski 87b8e3790d
docs/node-mixin: add fsMointpointSelector to alerts and dashboards (#2446)
* docs/node-mixin: add fsMountpointSelector

This adds the option to add a `mountpoint` selector to filesystem
related alerts. The default is `mountpoint!=""`.

* docs/node-mixins: add fsMountpointSelector to dashboards

Signed-off-by: Jan Fajerski <jfajersk@redhat.com>
2022-10-20 13:06:31 +02:00
Vitaly Zhuravlev 7519830a8a Change io time units to %util
When appying rate() to seconds we have 'seconds per second' or fractions of the second, so actually it actually can be from 0 to 1.

Also update intervalFactor to 1 for better rates.

Signed-off-by: Vitaly Zhuravlev <zhuravlev.vitaly@gmail.com>
2022-07-26 11:09:43 +02:00
Vitaly Zhuravlev 469600f4bf Update units of network ad disk graphs
https://prometheus.io/docs/prometheus/latest/querying/functions/#rate

rate() calculates per-second average rate, therefore Bps units should be used for disks.

In networking bandwidth throughput is usually measured in bits/s so units are changed accordingly.

Signed-off-by: Vitaly Zhuravlev <zhuravlev.vitaly@gmail.com>
2022-07-26 11:09:43 +02:00
Paweł Krupa (paulfantom) 8571536327 docs/node-mixin: add missing selectors
Signed-off-by: Paweł Krupa (paulfantom) <pawel@krupa.net.pl>
2022-07-19 16:44:16 +02:00
Sven Kieske d64766f43d
fix the following markdownlint issues (#2362)
fix the following markdownlint errors (and some more):

[..]mixins/node-exporter/README.md:13: MD031 Fenced code blocks should be surrounded by blank lines
[..]mixins/node-exporter/README.md:21: MD031 Fenced code blocks should be surrounded by blank lines
[..]mixins/node-exporter/README.md:27: MD031 Fenced code blocks should be surrounded by blank lines
[..]mixins/node-exporter/README.md:33: MD031 Fenced code blocks should be surrounded by blank lines
[..]mixins/node-exporter/README.md:41: MD034 Bare URL used
A detailed description of the rules is available at https://github.com/markdownlint/markdownlint/blob/master/docs/RULES.md

Signed-off-by: Sven Kieske <s.kieske@mittwald.de>
2022-06-28 05:50:06 +02:00
Björn Rabenstein e5128e83f2
Merge pull request #2364 from grafana/vzhuravlev/fs_table
mixin: Change disk graph to disk table
2022-06-08 20:46:47 +02:00
Jan Fajerski cec414df78 node-mixins/config: Switch fsAvailable warning and critical thresholds
Problem: In 0b50eb7294 the usage of the
threshold variables was adjusted. The values had been switched as well
resulting in reversed thresholds after the commit above. Warnings now
have a smaller threshold than critical alerts.

Solution: Adjust thresholds to reflect that warnings should be alerted
on before critical alerts.

Issues: https://github.com/prometheus/node_exporter/pull/2352

Signed-off-by: Jan Fajerski <jfajersk@redhat.com>
2022-06-07 12:10:48 +02:00
Björn Rabenstein b5a2ad46e3
Merge pull request #2351 from grafana/vzhuravlev/macos
Add darwin dashboard
2022-05-03 12:59:29 +02:00
Vitaly Zhuravlev eef827006a Change disk graph to disk table
Signed-off-by: Vitaly Zhuravlev <zhuravlev.vitaly@gmail.com>
2022-04-27 19:15:50 +04:00
Daniel Lenar 0b50eb7294 Reverse fsSpaceAvailableCriticalThreshold and fsSpaceAvailableWarningThreshold
Currently critical alert for space available alerts on warning and
warning alert for space available alerts on critical.

Signed-off-by: Daniel Lenar <dlenar@vailsys.com>
2022-04-21 11:34:54 -05:00
Gabriel Amaral Antunes 410e069471 Add darwin dashboard to mixin
Signed-off-by: Vitaly Zhuravlev <zhuravlev.vitaly@gmail.com>
2022-04-20 15:18:43 +04:00
Vitaly Zhuravlev 8823605f12 Fix NodeFileDescriptorLimit alerts
Signed-off-by: Vitaly Zhuravlev <zhuravlev.vitaly@gmail.com>
2022-04-07 16:25:17 +04:00
Severyn Lisovskyi 7b86b7cb29
[node-mixin] change current datasource to grafana's default
Signed-off-by: Severyn Lisovskyi <993215+sev3ryn@users.noreply.github.com>
2022-02-02 14:45:26 +01:00
Julian Wiedmann 3e6f4ce627
mixin: exclude iowait and steal from CPU Utilisation (#2194)
'iowait' and 'steal' indicate specific idle/wait states, which shouldn't
be counted into CPU Utilisation. Also see
https://github.com/prometheus-operator/kube-prometheus/pull/796 and
https://github.com/kubernetes-monitoring/kubernetes-mixin/pull/667.

Per the iostat man page:

%idle
    Show the percentage of time that the CPU or CPUs were idle and the
    system did not have an outstanding disk I/O request.

%iowait
     Show the percentage of time that the CPU or CPUs were idle during
     which the system had an outstanding disk I/O request.

%steal
     Show the percentage of time spent in involuntary wait by the
     virtual CPU or CPUs while the hypervisor was servicing another
     virtual processor.

Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
2021-11-04 11:03:27 +01:00
Ben Kochie 421fc429f3
Replace deprecated linter (#2176)
Upstream is replacing `golint` with `revive`.
* Cleanup unused mixin go files.

Signed-off-by: Ben Kochie <superq@gmail.com>
2021-10-27 11:01:15 +02:00
ngc104 4bc1c02000
fix bug in #2130 (#2170)
Signed-off-by: Yves Mettier <yves.mettier@orange.com>

Co-authored-by: Yves Mettier <yves.mettier@orange.com>
2021-10-21 12:07:38 +02:00
Tom Wilkie 9bc184d236
Datasource template variable should be labelled 'Data Source'
Signed-off-by: Tom Wilkie <tom@grafana.com>
2021-10-20 17:10:14 +01:00
Ben Kochie 5a38949451
Fix up mixin tests (#2167)
Use new Go install format, cleanup working dir setup.

Signed-off-by: Ben Kochie <superq@gmail.com>
2021-10-14 11:06:01 +02:00
Julien Pivotto 68a6c78c0d
Update go to 1.17 (#2159)
Signed-off-by: Julien Pivotto <roidelapluie@inuits.eu>
2021-10-03 13:35:24 +02:00
Ben Kochie aeef1edd62
mixin: Add fallback for MemAvailable (#2130)
Add a fallback to Buffers+Cached+MemFree+Slab for older Linux kernels
where the MemAvailable metric is not available for memory utilization.

Signed-off-by: Ben Kochie <superq@gmail.com>
2021-09-28 10:22:06 +02:00
Johannes 'fish' Ziemke 6f1286b314 mixin: Drop mode label for num cpu metric
Signed-off-by: Johannes 'fish' Ziemke <github@freigeist.org>
2021-09-03 12:13:35 +02:00
Johannes 'fish' Ziemke fa9926c4eb mixin: Cheaper calculation for instance:node_num_cpu:sum
Signed-off-by: Johannes 'fish' Ziemke <github@freigeist.org>
2021-09-03 11:34:25 +02:00
paulfantom 832909dd25 docs/node-mixin/alerts: make NodeFilesystemAlmostOutOfSpace fire earlier
Signed-off-by: paulfantom <pawel@krupa.net.pl>
2021-08-16 16:35:58 +02:00
Johannes 'fish' Ziemke 7fc5c6045a Read config from $
Signed-off-by: Johannes 'fish' Ziemke <github@freigeist.org>
2021-07-27 16:32:05 +02:00
ArthurSens 3731f93fd7 Refactor USE method mixin dashboards with grafonnet-lib, add multi-cluster support.
Aiming for cleaner code and following standards used on younger mixins.

Signed-off-by: ArthurSens <arthursens2005@gmail.com>
2021-07-27 16:32:05 +02:00
Frederic Hemberger 5bee84f30d docs: Replace go get with go install for command installation
`go get` is deprecated for installation of commands as of go v1.17
Ref: https://go.googlesource.com/go/+/ced0fdbad0655d63d535390b1a7126fd1fef8348

Signed-off-by: Frederic Hemberger <mail@frederic-hemberger.de>
2021-07-20 12:16:46 +02:00
Loïc Blot 55ffe57cbc
feat(rules): add NodeFileDescriptorLimit kernel exhaustion alert
Add a new alert when fs.file-nr is close to fs.file-max

Signed-off-by: Loic Blot <loic.blot@unix-experience.fr>
2021-04-30 12:40:09 +02:00
raviprasad_lr 504f9b785c fix interval in graphs panels of node dashboard
Signed-off-by: raviprasad_lr <raviprasad_lr@yahoo.com>
2021-04-26 11:14:30 +02:00
Johannes 'fish' Ziemke a5908bf82b Make interval configurable
Signed-off-by: Johannes 'fish' Ziemke <github@freigeist.org>
2021-04-07 09:37:04 +02:00
Johannes 'fish' Ziemke 772335caa8 Use 5m rate in mixins
The default scrape interval of Prometheus is 60s, so we can't use a 1m
rate.

Signed-off-by: Johannes 'fish' Ziemke <github@freigeist.org>
2021-04-07 09:37:04 +02:00
Ben Kochie eefb18db02
Merge pull request #1764 from dhoppe/patch-1
Use description instead of message as field for annotations
2021-01-24 14:56:03 +01:00
Ben Kochie 4b68aeb80a
Merge pull request #1862 from fsschmitt/fix/alerts-label-naming
fix: node_md_disks state label from fail to failed
2021-01-24 14:53:22 +01:00
Julien Pivotto f645d49242 Mixin: Bump jsonnet requirement to 0.16 to use go-jsonnetcmd
Signed-off-by: Julien Pivotto <roidelapluie@inuits.eu>
2020-10-27 11:41:46 +01:00
Matthias Loibl 77e76485c0
Use absolute jsonnet import paths
This should be the way forward when importing libraries in jsonnet. It's
closer to how Go imports look and makes it more obvious where packages
live.

This is not breaking anything, as the old imports were already symlinks
to the now directly used directories.

Signed-off-by: Matthias Loibl <mail@matthiasloibl.com>
2020-10-20 11:34:43 +02:00