Commit graph

134 commits

Author SHA1 Message Date
Michal 186e2e79c8
add yamllint config, fix yamllint errors (#2088)
After a recent change in prometheus/prometheus, Makefile.common includes
now a yamllint target which currently fails. This PR adds the missing
yamllint config and fixes the yamllint errors.

Signed-off-by: Michal Wasilewski <mwasilewski@gmx.com>
2021-09-29 20:12:14 +02:00
Ben Kochie aeef1edd62
mixin: Add fallback for MemAvailable (#2130)
Add a fallback to Buffers+Cached+MemFree+Slab for older Linux kernels
where the MemAvailable metric is not available for memory utilization.

Signed-off-by: Ben Kochie <superq@gmail.com>
2021-09-28 10:22:06 +02:00
Johannes 'fish' Ziemke 6f1286b314 mixin: Drop mode label for num cpu metric
Signed-off-by: Johannes 'fish' Ziemke <github@freigeist.org>
2021-09-03 12:13:35 +02:00
Johannes 'fish' Ziemke fa9926c4eb mixin: Cheaper calculation for instance:node_num_cpu:sum
Signed-off-by: Johannes 'fish' Ziemke <github@freigeist.org>
2021-09-03 11:34:25 +02:00
paulfantom 832909dd25 docs/node-mixin/alerts: make NodeFilesystemAlmostOutOfSpace fire earlier
Signed-off-by: paulfantom <pawel@krupa.net.pl>
2021-08-16 16:35:58 +02:00
Johannes 'fish' Ziemke 7fc5c6045a Read config from $
Signed-off-by: Johannes 'fish' Ziemke <github@freigeist.org>
2021-07-27 16:32:05 +02:00
ArthurSens 3731f93fd7 Refactor USE method mixin dashboards with grafonnet-lib, add multi-cluster support.
Aiming for cleaner code and following standards used on younger mixins.

Signed-off-by: ArthurSens <arthursens2005@gmail.com>
2021-07-27 16:32:05 +02:00
Frederic Hemberger 5bee84f30d docs: Replace go get with go install for command installation
`go get` is deprecated for installation of commands as of go v1.17
Ref: https://go.googlesource.com/go/+/ced0fdbad0655d63d535390b1a7126fd1fef8348

Signed-off-by: Frederic Hemberger <mail@frederic-hemberger.de>
2021-07-20 12:16:46 +02:00
Loïc Blot 55ffe57cbc
feat(rules): add NodeFileDescriptorLimit kernel exhaustion alert
Add a new alert when fs.file-nr is close to fs.file-max

Signed-off-by: Loic Blot <loic.blot@unix-experience.fr>
2021-04-30 12:40:09 +02:00
raviprasad_lr 504f9b785c fix interval in graphs panels of node dashboard
Signed-off-by: raviprasad_lr <raviprasad_lr@yahoo.com>
2021-04-26 11:14:30 +02:00
Johannes 'fish' Ziemke a5908bf82b Make interval configurable
Signed-off-by: Johannes 'fish' Ziemke <github@freigeist.org>
2021-04-07 09:37:04 +02:00
Johannes 'fish' Ziemke 772335caa8 Use 5m rate in mixins
The default scrape interval of Prometheus is 60s, so we can't use a 1m
rate.

Signed-off-by: Johannes 'fish' Ziemke <github@freigeist.org>
2021-04-07 09:37:04 +02:00
Ben Kochie eefb18db02
Merge pull request #1764 from dhoppe/patch-1
Use description instead of message as field for annotations
2021-01-24 14:56:03 +01:00
Ben Kochie 4b68aeb80a
Merge pull request #1862 from fsschmitt/fix/alerts-label-naming
fix: node_md_disks state label from fail to failed
2021-01-24 14:53:22 +01:00
Anthony D'Atri 8b466360a3
Modest doc improvements (#1876)
* Modest doc improvements

Signed-off-by: Anthony D'Atri <anthony.datri@gmail.com>
2020-11-25 16:46:58 +01:00
Julien Pivotto f645d49242 Mixin: Bump jsonnet requirement to 0.16 to use go-jsonnetcmd
Signed-off-by: Julien Pivotto <roidelapluie@inuits.eu>
2020-10-27 11:41:46 +01:00
Matthias Loibl 77e76485c0
Use absolute jsonnet import paths
This should be the way forward when importing libraries in jsonnet. It's
closer to how Go imports look and makes it more obvious where packages
live.

This is not breaking anything, as the old imports were already symlinks
to the now directly used directories.

Signed-off-by: Matthias Loibl <mail@matthiasloibl.com>
2020-10-20 11:34:43 +02:00
Björn Rabenstein 9c9c636305
Merge pull request #1861 from paulfantom/network-alerts
docs/node-mixin/alerts: use ratio for network alerts
2020-10-19 12:14:24 +02:00
paulfantom f81747e608 docs/node-mixin/alerts: add max error condition to alert about desynchronized clock
Signed-off-by: paulfantom <pawel@krupa.net.pl>
2020-10-08 11:15:16 +02:00
fsschmitt effa4da989 fix: node_md_disks state label as failed
Signed-off-by: fsschmitt <492108+fsschmitt@users.noreply.github.com>
2020-10-07 14:20:56 +01:00
paulfantom d7cbe85d22
docs/node-mixin/alerts: use a rate for network alerts
Signed-off-by: paulfantom <pawel@krupa.net.pl>
2020-10-07 13:04:51 +02:00
Arthur Outhenin-Chalandre 6585e43eec Fix memory gauge in mixin with multiple pods
Signed-off-by: Arthur Outhenin-Chalandre <arthur@cri.epita.fr>
2020-09-23 15:36:43 +02:00
Nicolas Lamirault ff2ff3410f
Configure 2 thresholds for NodeFilesystemAlmostOutOfSpace alert (#1835)
* Add: configure 2 thresholds for NodeFilesystemAlmostOutOfSpace alert

Signed-off-by: Nicolas Lamirault <nicolas.lamirault@gmail.com>
2020-09-18 11:28:32 +02:00
Rajat Vig 7dd8adf7ed
Fix NodeRAIDDegraded to not use a string rule expressions
Signed-off-by: Rajat Vig <rvig@etsy.com>
2020-08-28 10:43:39 +01:00
Simon Pasquier 02212dd2c6 Run jsonnetfmt
Signed-off-by: Simon Pasquier <spasquie@redhat.com>
2020-08-25 10:15:30 +02:00
Hao Ke 9b7a0d06a1 Fix syntax error
Signed-off-by: Hao Ke <hao.ke@auryc.com>

Signed-off-by: Simon Pasquier <spasquie@redhat.com>
2020-08-25 10:07:37 +02:00
Simon Pasquier 6d959e2e8c *: add mixin tests to CI
Signed-off-by: Simon Pasquier <spasquie@redhat.com>
2020-08-25 10:03:46 +02:00
paulfantom e4ec8e04c5 docs/node-mixin: add alerts about failing RAID array
Signed-off-by: paulfantom <pawel@krupa.net.pl>
2020-08-24 16:17:20 +02:00
Dennis Hoppe fc64b70386
Use description instead of message as field for annotations
Signed-off-by: Dennis Hoppe <github@debian-solutions.de>
2020-06-24 13:38:57 +02:00
Frederic Branczyk b42819b69d
Merge pull request #1657 from povilasv/NodeTextFileCollectorScrapeError
Add NodeTextFileCollectorScrapeError alert to mixin
2020-04-30 17:54:06 +02:00
jangdm d4d2e1db98
fix typo in TIME.md (#1670)
fix typo in TIME.md

Signed-off-by: jangdm <jamin4@naver.com>
2020-04-09 09:00:00 +02:00
WOO CHANG HO 612ea0cd12 Add more compatible rules
Signed-off-by: zodiac12k <zodiac12k@gmail.com>
2020-04-08 10:19:44 +02:00
Povilas Versockas bd3e6d224c
Add NodeTextFileCollectorScrapeError alert to mixin
Signed-off-by: Povilas Versockas <p.versockas@gmail.com>
2020-03-31 18:12:36 +03:00
beorn7 8b00b22904 Fix sign error in NodeClockSkewDetected
Signed-off-by: beorn7 <beorn@grafana.com>
2020-03-25 13:07:23 +01:00
paulfantom 820f8d595e
docs/node-mixin: alert on desynchronised clock
Signed-off-by: paulfantom <pawel@krupa.net.pl>
2020-03-23 08:23:58 +01:00
Neraud 1006a2c4bb Add missing coma
Signed-off-by: Neraud <neraud.login@gmail.com>
2020-03-21 13:06:43 +01:00
Povilas Versockas 48bb6f670c Add NodeHighNumberConntrackEntriesUsed
Signed-off-by: Povilas Versockas <p.versockas@gmail.com>
2020-03-20 17:46:05 +01:00
iuri aranda 0107bc7942
Make FS space alerts thresholds configurable (#1624)
* Make FS space alerts thresholds configurable (#1)

This makes it possible to tweak the thresholds for
the NodeFilesystemSpaceFillingUp alerts. Which
might be necessary in systems like Kubernetes,
where the image garbage collector runs at 85%,
so it's not a problem that the disk reaches that usage %.

Signed-off-by: iuri aranda <iuri@skyscrapers.eu>
2020-03-02 16:24:51 +01:00
paulfantom 40570924b1
docs/node-mixin/dashboards: do not mix tabs and spaces
Signed-off-by: paulfantom <pawel@krupa.net.pl>
2019-11-01 15:46:21 +01:00
beorn7 c6914477f5 Fix the normalization for the cluster-wide dashboards
We actually have to count or sum, respectively, _all_ the selected
metrics for the cluster-wide view. Which means it's easiest to use the
`scalar` approach after all (but only in the cluster dashboard). This
still propagates all the labels.

I have extended the comment for the `nodeExporterSelector` to note
that the cluster dashboard only makes sense if all the selected node
exporter actually belong to the same cluster.

Since this is jsonnet, users can easily disable the cluster
dashboard. Or even create multiple instances of the dashboards with
different `nodeExporterSelector`s for different clusters.

Signed-off-by: beorn7 <beorn@grafana.com>
2019-10-30 22:52:36 +01:00
Benoît Knecht 5a7b85876d docs/node-mixin: Improve memory pressure rule
The `instance:node_memory_swap_io_pages:rate1m` rule was intended to
measure the amount of memory pressure a system is under, but its name is
a bit misleading (it specifically refers to swap), and the rate of
`node_vmstat_pgmajfault` is a better metric for memory pressure
(see #1524).

This commit renames `instance:node_memory_swap_io_pages:rate1m` to
`instance:node_vmstat_pgmajfault:rate1m`, and defines it as
`rate(node_vmstat_pgmajfault{%(nodeExporterSelector)s}[1m])`. The
dashboards are updated accordingly.

Signed-off-by: Benoît Knecht <benoit.knecht@fsfe.org>
2019-10-28 15:12:42 +01:00
Scott Brenner 813a4bdf8b Two quick typo fixes
Signed-off-by: Scott Brenner <scott@scottbrenner.me>
2019-10-09 20:42:27 -07:00
Björn Rabenstein 855a1f1d18
Merge pull request #1482 from leojonathanoh/fix-node-mixin-prometheus-alert-rules-to-use-percentage
Fix node-mixin prometheus alert rules to use percentage
2019-09-26 20:01:18 +02:00
Sergiusz Urbaniak f4417b209a node-mixin: fix configuration for unset fsSelector/diskDeviceSelector
As per https://github.com/prometheus/node_exporter/pull/1429#discussion_r304210103
we want to fetch all devices and all fs types.

Currently, this is done by setting empty string which breaks most queries which rely on it.

This fixes it by setting the appropriate selector instead of empty string.

Signed-off-by: Sergiusz Urbaniak <sergiusz.urbaniak@gmail.com>
2019-09-12 14:02:56 +02:00
Sergiusz Urbaniak ed78237036 node-mixin: fix query in Disk Space Utilisation dashboard
Signed-off-by: Sergiusz Urbaniak <sergiusz.urbaniak@gmail.com>
2019-09-12 14:02:56 +02:00
Leo dfeec07f2f Fix node-mixin prometheus alert rules to use percentage
Signed-off-by: Leo <leonardjonathanoh@live.com>
2019-09-11 08:47:24 +00:00
Björn Rabenstein ab8cf1f718 Node mixin: Clarify dashboard dependency on rules (#1475)
Following @discordianfish's suggestion
[here](https://github.com/prometheus/node_exporter/issues/1454#issuecomment-524225222).

Signed-off-by: beorn7 <beorn@grafana.com>
2019-09-08 10:55:43 +02:00
beorn7 76ff263ca6 Update legendLink
This still had the 'k8s' in as it was copied and pasted from the
kubernetes-mixin.

Signed-off-by: beorn7 <beorn@grafana.com>
2019-08-20 18:49:12 +02:00
Björn Rabenstein 0f38d680b4
Merge pull request #1449 from prometheus/beorn7/mixin3
node-mixin: Make the severity of "critical" alerts configurable
2019-08-19 13:55:52 +02:00
beorn7 44e5731de7 Add line for number of cores to load graph
Backported from the node dashboard in the kubernetes-mixin.

Signed-off-by: beorn7 <beorn@grafana.com>
2019-08-15 16:43:57 +02:00
beorn7 024d5ed55e Fix title of CPU panel to usage
We use the `mode="idle"` metric, but we are inverting it, so this is
usage, and that's intended.

Signed-off-by: beorn7 <beorn@grafana.com>
2019-08-15 16:36:10 +02:00
beorn7 a016d9cd6f node-mixin: Improve disk usage panel
- Use a stacked graph instead of a gauge as development over time is
  especially useful for disk space usage.

- By only taking one metric per device into account, we avoid
  double-counting for devices that are mounted multiple times.

Signed-off-by: beorn7 <beorn@grafana.com>
2019-08-15 16:32:54 +02:00
Björn Rabenstein 7ef6f2576d
node-mxin: Improve nodes dashboard (#1448)
* node-mixin: Improve nodes dashboard

- Use stacking where it makes sense.
- Normalize idle CPU so that stacking is more meaningful.
- Consistently fill where stacking is used but don't fill where not.
- Fix y axis max value for Idle CPU panel.
- Fix y axis min value for memory usage panel.
- Use `$__interval` for range where applicable (and set min step
  to 1m).
- Make the right Y axis for disk I/O actually work.

This is just an incremental improvements. It doesn't touch the more
involved TODOs.

Signed-off-by: beorn7 <beorn@grafana.com>
2019-08-15 00:40:51 +02:00
beorn7 97ef113762 Make the severity of "critical" alerts configurable
This addresses the blissful scenario where single-node failures are
unproblematic. No reason to wake somebody up if a node is about to
screw itself up by filling the disk.

Signed-off-by: beorn7 <beorn@grafana.com>
2019-08-14 22:24:24 +02:00
beorn7 f350aaf87e node-mixin: Fix various straight-forward issues in the USE dashboards
- Normalize cluster memory utilisation.

- Fix missing `1m` in memory saturation.

- Have both disk-related row next to each other instead with the
  network row in between.

- Correctly render transmit network traffic as negative, using
  `seriesOverrides` and `min: null` for the y-axis.

- Make panel and row naming consistent.

- Remove legend where it would just display a single entry with
  exactly the title of the panel.

- Fix metric name in individual node CPU Saturation panel.

- Break up disk space utilisation by device in the panel for an
  individual node.

NB: All of that doesn't touch any more subtle issues captured in the
various TODOs.

Signed-off-by: beorn7 <beorn@grafana.com>
2019-08-13 21:54:28 +02:00
paulfantom c41826274d
docs/node-mixin: move fsSelector and diskDeviceSelector to the end of query
This will cause a query to be valid even if values of selector are
empty.

Additionally fixing query responsible for disk space usage.

Signed-off-by: paulfantom <pawel@krupa.net.pl>
2019-07-24 13:05:02 +02:00
beorn7 79f0357e38 Added _excluding_lo to name of network rules that exclude lo
Signed-off-by: beorn7 <beorn@grafana.com>
2019-07-22 20:21:52 +02:00
beorn7 36dc7451c9 Improvement of comments and panel titles
Signed-off-by: beorn7 <beorn@grafana.com>
2019-07-22 14:06:27 +02:00
beorn7 e01d9f9e78 Break out device in disk IO rules/dashboard
Signed-off-by: beorn7 <beorn@grafana.com>
2019-07-18 15:59:35 +02:00
beorn7 b8c4b0cb29 Removed unneeded sum_ and avg_ from rule names
Signed-off-by: beorn7 <beorn@grafana.com>
2019-07-18 14:14:02 +02:00
beorn7 706511a495 Responses to review comments, round 3
Signed-off-by: beorn7 <beorn@grafana.com>
2019-07-17 23:54:31 +02:00
beorn7 3a770a0b1d Convert annotations from message to summary/description
Signed-off-by: beorn7 <beorn@grafana.com>
2019-07-16 21:40:57 +02:00
beorn7 a92d1d7889 Address review comments, batch 2
Signed-off-by: beorn7 <beorn@grafana.com>
2019-07-16 21:18:17 +02:00
beorn7 3ab1f41d12 Make more use of config.libsonnet
Signed-off-by: beorn7 <beorn@grafana.com>
2019-07-16 19:34:27 +02:00
beorn7 2180c2f3bf Address first batch of old review comments
Signed-off-by: beorn7 <beorn@grafana.com>
2019-07-16 19:14:17 +02:00
beorn7 b3b47f2d07 Make selector naming consistent
Signed-off-by: beorn7 <beorn@grafana.com>
2019-07-10 20:09:01 +02:00
beorn7 dec5b5b053 Fix indentation
Signed-off-by: beorn7 <beorn@grafana.com>
2019-07-10 20:07:20 +02:00
beorn7 9d7045e483 (Re-)adjust to Grafana gauge expecting percentage 0-100 (rather than 1-0)
Signed-off-by: beorn7 <beorn@grafana.com>
2019-07-10 19:40:04 +02:00
beorn7 f331b308f3 Use promgrafonnet as a vendored library from its source
The only deviation that happened so far is to use format="percentunit"
in a Grafana gauge. This change wasn't even properly used in this repo
so far, so I opted to stick with "upstream" for now. If changes are
really needed, we can try to change upstream first.

Another change was done in parallal here and upstream, but it was
"more correct" in upstream. (Change datasource to $datasource
variable, only partially applied here.) Which is another point for
using the upstream and not copy it here.

Signed-off-by: beorn7 <beorn@grafana.com>
2019-07-06 21:11:23 +02:00
beorn7 e5266c242e Add README.md
Signed-off-by: beorn7 <beorn@grafana.com>
2019-07-06 20:30:40 +02:00
beorn7 f2891703a5 Add Makefile to easily make output files and lint sources
Signed-off-by: beorn7 <beorn@grafana.com>
2019-07-06 20:21:56 +02:00
beorn7 f17829c48b Create jsonnet files to create output files
This allows to create YAML files with rules and JSON files with
dashboard descriptions.

Signed-off-by: beorn7 <beorn@grafana.com>
2019-07-06 20:11:27 +02:00
beorn7 cd2981f1b8 Update vendoring to current location of jsonnet-libs
Signed-off-by: beorn7 <beorn@grafana.com>
2019-07-06 20:10:47 +02:00
beorn7 2df034c055 Move node-mixin into docs directory
Signed-off-by: beorn7 <beorn@grafana.com>
2019-07-05 19:38:03 +02:00
Cougar 764da30556 Add compat rules for node_time, node_memory_ShmemHugePages and node_memory_ShmemPmdMapped (#1138)
Signed-off-by: Cougar <cougar@random.ee>
2018-11-05 16:40:19 +01:00
Ben Kochie 5d23ad0ca7
Fix supervisord collector (#978)
* Replace supervisord xmlrpc library
* Use `github.com/mattn/go-xmlrpc` that doesn't leak goroutines.
* Fix uptime metric

* Use Prometheus best practices for uptime metric.
  * Use "start time" rather than "uptime".
  * Don't emit a start time if the process is down.
* Add changelog entry.
* Add example compatibility rules.

Signed-off-by: Ben Kochie <superq@gmail.com>
2018-08-06 16:54:46 +02:00
Rene Treffer 80a5712b97 Fix sample rules for migration (#1022)
- add conversion from _ms to _seconds on disk metrics
- add missing node_textfile_mtime section
- add groups: header to pass promtool check rules

Signed-off-by: Rene Treffer <rene.treffer@soundcloud.com>
2018-07-27 14:27:44 +02:00
Ivan Kiselev ae90bac5b8 Add example of translating new metrics to old format in case of migration to 1.16 version (#982)
Add additional example of how to save old metrics

Signed-off-by: Ivan Kiselev <ivan@messagebird.com>
2018-07-02 12:39:32 +02:00
Roman Vynar 55c32fcf02 Add compat rules for filesystem collector. (#973)
Signed-off-by: Roman Vynar <roman.vynar@goquiq.com>
2018-06-13 18:32:07 +02:00
Nicholas Capo 09d11817d0 docs: Add example recording rule for node_memory_MemAvailable
Signed-off-by: Nicholas Capo <nicholas.capo@gmail.com>
2018-05-16 17:01:51 -05:00
Ben Kochie c5a74ce1a1
Add label mangling.
Signed-off-by: Ben Kochie <superq@gmail.com>
2018-05-14 12:24:05 +02:00
Ben Kochie dc1972e9e3
Document upgrade options for v0.16.0
* Add an upgrade guide.
* Add an example recording rules.

Signed-off-by: Ben Kochie <superq@gmail.com>
2018-05-11 13:45:36 +02:00
Brian Brazil 52c031890e
Add _seconds suffix to node_time. (#823) 2018-02-14 16:59:08 +00:00
Leonid Evdokimov c169b4b1c5 Add metrics from SNTPv4 packet to ntp collector & add ntpd sanity check (#655)
* Add metrics from SNTPv4 packet to ntp collector & add ntpd sanity check

1. Checking local clock against remote NTP daemon is bad idea, local
ntpd acting as a  client should do it better and avoid excessive load on
remote NTP server so the collector is refactored to query local NTP
server.

2. Checking local clock against remote one does not check local ntpd
itself. Local ntpd may be down or out of sync due to network issues, but
clock will be OK.

3. Checking NTP server using sanity of it's response is tricky and
depends on ntpd implementation, that's why common `node_ntp_sanity`
variable is exported.

* `govendor add golang.org/x/net/ipv4`, it is dependency of github.com/beevik/ntp

* Update github.com/beevik/ntp to include boring SNTP fix

* Use variable name from RFC5905

* ntp: move code to make export of raw metrics more explicit

* Move NTP math to `github.com/beevik/ntp`

* Make `golint` happy

* Add some brief docs explaining `ntp` #655 and `timex` #664 modules

* ntp: drop XXX comment that got its decision

* ntp: add `_seconds` suffix to relevant metrics

* Better `node_ntp_leap` comment

* s/node_ntp_reftime/node_ntp_reference_timestamp_seconds/ as requested by @discordianfish

* Extract subsystem name to const as suggested by @SuperQ
2017-09-19 10:36:14 +02:00