node_exporter

mirror of https://github.com/prometheus/node_exporter.git synced 2025-08-20 18:33:52 -07:00

Author	SHA1	Message	Date
paulfantom	820f8d595e	docs/node-mixin: alert on desynchronised clock Signed-off-by: paulfantom <pawel@krupa.net.pl>	2020-03-23 08:23:58 +01:00
Neraud	1006a2c4bb	Add missing coma Signed-off-by: Neraud <neraud.login@gmail.com>	2020-03-21 13:06:43 +01:00
Povilas Versockas	48bb6f670c	Add NodeHighNumberConntrackEntriesUsed Signed-off-by: Povilas Versockas <p.versockas@gmail.com>	2020-03-20 17:46:05 +01:00
iuri aranda	0107bc7942	Make FS space alerts thresholds configurable (#1624 ) * Make FS space alerts thresholds configurable (#1) This makes it possible to tweak the thresholds for the NodeFilesystemSpaceFillingUp alerts. Which might be necessary in systems like Kubernetes, where the image garbage collector runs at 85%, so it's not a problem that the disk reaches that usage %. Signed-off-by: iuri aranda <iuri@skyscrapers.eu>	2020-03-02 16:24:51 +01:00
paulfantom	40570924b1	docs/node-mixin/dashboards: do not mix tabs and spaces Signed-off-by: paulfantom <pawel@krupa.net.pl>	2019-11-01 15:46:21 +01:00
beorn7	c6914477f5	Fix the normalization for the cluster-wide dashboards We actually have to count or sum, respectively, _all_ the selected metrics for the cluster-wide view. Which means it's easiest to use the `scalar` approach after all (but only in the cluster dashboard). This still propagates all the labels. I have extended the comment for the `nodeExporterSelector` to note that the cluster dashboard only makes sense if all the selected node exporter actually belong to the same cluster. Since this is jsonnet, users can easily disable the cluster dashboard. Or even create multiple instances of the dashboards with different `nodeExporterSelector`s for different clusters. Signed-off-by: beorn7 <beorn@grafana.com>	2019-10-30 22:52:36 +01:00
Benoît Knecht	5a7b85876d	docs/node-mixin: Improve memory pressure rule The `instance:node_memory_swap_io_pages:rate1m` rule was intended to measure the amount of memory pressure a system is under, but its name is a bit misleading (it specifically refers to swap), and the rate of `node_vmstat_pgmajfault` is a better metric for memory pressure (see #1524). This commit renames `instance:node_memory_swap_io_pages:rate1m` to `instance:node_vmstat_pgmajfault:rate1m`, and defines it as `rate(node_vmstat_pgmajfault{%(nodeExporterSelector)s}[1m])`. The dashboards are updated accordingly. Signed-off-by: Benoît Knecht <benoit.knecht@fsfe.org>	2019-10-28 15:12:42 +01:00
Scott Brenner	813a4bdf8b	Two quick typo fixes Signed-off-by: Scott Brenner <scott@scottbrenner.me>	2019-10-09 20:42:27 -07:00
Björn Rabenstein	855a1f1d18	Merge pull request #1482 from leojonathanoh/fix-node-mixin-prometheus-alert-rules-to-use-percentage Fix node-mixin prometheus alert rules to use percentage	2019-09-26 20:01:18 +02:00
Sergiusz Urbaniak	f4417b209a	node-mixin: fix configuration for unset fsSelector/diskDeviceSelector As per https://github.com/prometheus/node_exporter/pull/1429#discussion_r304210103 we want to fetch all devices and all fs types. Currently, this is done by setting empty string which breaks most queries which rely on it. This fixes it by setting the appropriate selector instead of empty string. Signed-off-by: Sergiusz Urbaniak <sergiusz.urbaniak@gmail.com>	2019-09-12 14:02:56 +02:00
Sergiusz Urbaniak	ed78237036	node-mixin: fix query in Disk Space Utilisation dashboard Signed-off-by: Sergiusz Urbaniak <sergiusz.urbaniak@gmail.com>	2019-09-12 14:02:56 +02:00
Leo	dfeec07f2f	Fix node-mixin prometheus alert rules to use percentage Signed-off-by: Leo <leonardjonathanoh@live.com>	2019-09-11 08:47:24 +00:00
Björn Rabenstein	ab8cf1f718	Node mixin: Clarify dashboard dependency on rules (#1475 ) Following @discordianfish's suggestion [here](https://github.com/prometheus/node_exporter/issues/1454#issuecomment-524225222). Signed-off-by: beorn7 <beorn@grafana.com>	2019-09-08 10:55:43 +02:00
beorn7	76ff263ca6	Update legendLink This still had the 'k8s' in as it was copied and pasted from the kubernetes-mixin. Signed-off-by: beorn7 <beorn@grafana.com>	2019-08-20 18:49:12 +02:00
Björn Rabenstein	0f38d680b4	Merge pull request #1449 from prometheus/beorn7/mixin3 node-mixin: Make the severity of "critical" alerts configurable	2019-08-19 13:55:52 +02:00
beorn7	44e5731de7	Add line for number of cores to load graph Backported from the node dashboard in the kubernetes-mixin. Signed-off-by: beorn7 <beorn@grafana.com>	2019-08-15 16:43:57 +02:00
beorn7	024d5ed55e	Fix title of CPU panel to usage We use the `mode="idle"` metric, but we are inverting it, so this is usage, and that's intended. Signed-off-by: beorn7 <beorn@grafana.com>	2019-08-15 16:36:10 +02:00
beorn7	a016d9cd6f	node-mixin: Improve disk usage panel - Use a stacked graph instead of a gauge as development over time is especially useful for disk space usage. - By only taking one metric per device into account, we avoid double-counting for devices that are mounted multiple times. Signed-off-by: beorn7 <beorn@grafana.com>	2019-08-15 16:32:54 +02:00
Björn Rabenstein	7ef6f2576d	node-mxin: Improve nodes dashboard (#1448 ) * node-mixin: Improve nodes dashboard - Use stacking where it makes sense. - Normalize idle CPU so that stacking is more meaningful. - Consistently fill where stacking is used but don't fill where not. - Fix y axis max value for Idle CPU panel. - Fix y axis min value for memory usage panel. - Use `$__interval` for range where applicable (and set min step to 1m). - Make the right Y axis for disk I/O actually work. This is just an incremental improvements. It doesn't touch the more involved TODOs. Signed-off-by: beorn7 <beorn@grafana.com>	2019-08-15 00:40:51 +02:00
beorn7	97ef113762	Make the severity of "critical" alerts configurable This addresses the blissful scenario where single-node failures are unproblematic. No reason to wake somebody up if a node is about to screw itself up by filling the disk. Signed-off-by: beorn7 <beorn@grafana.com>	2019-08-14 22:24:24 +02:00
beorn7	f350aaf87e	node-mixin: Fix various straight-forward issues in the USE dashboards - Normalize cluster memory utilisation. - Fix missing `1m` in memory saturation. - Have both disk-related row next to each other instead with the network row in between. - Correctly render transmit network traffic as negative, using `seriesOverrides` and `min: null` for the y-axis. - Make panel and row naming consistent. - Remove legend where it would just display a single entry with exactly the title of the panel. - Fix metric name in individual node CPU Saturation panel. - Break up disk space utilisation by device in the panel for an individual node. NB: All of that doesn't touch any more subtle issues captured in the various TODOs. Signed-off-by: beorn7 <beorn@grafana.com>	2019-08-13 21:54:28 +02:00
paulfantom	c41826274d	docs/node-mixin: move fsSelector and diskDeviceSelector to the end of query This will cause a query to be valid even if values of selector are empty. Additionally fixing query responsible for disk space usage. Signed-off-by: paulfantom <pawel@krupa.net.pl>	2019-07-24 13:05:02 +02:00
beorn7	79f0357e38	Added `_excluding_lo` to name of network rules that exclude lo Signed-off-by: beorn7 <beorn@grafana.com>	2019-07-22 20:21:52 +02:00
beorn7	36dc7451c9	Improvement of comments and panel titles Signed-off-by: beorn7 <beorn@grafana.com>	2019-07-22 14:06:27 +02:00
beorn7	e01d9f9e78	Break out device in disk IO rules/dashboard Signed-off-by: beorn7 <beorn@grafana.com>	2019-07-18 15:59:35 +02:00
beorn7	b8c4b0cb29	Removed unneeded `sum_` and `avg_` from rule names Signed-off-by: beorn7 <beorn@grafana.com>	2019-07-18 14:14:02 +02:00
beorn7	706511a495	Responses to review comments, round 3 Signed-off-by: beorn7 <beorn@grafana.com>	2019-07-17 23:54:31 +02:00
beorn7	3a770a0b1d	Convert annotations from message to summary/description Signed-off-by: beorn7 <beorn@grafana.com>	2019-07-16 21:40:57 +02:00
beorn7	a92d1d7889	Address review comments, batch 2 Signed-off-by: beorn7 <beorn@grafana.com>	2019-07-16 21:18:17 +02:00
beorn7	3ab1f41d12	Make more use of config.libsonnet Signed-off-by: beorn7 <beorn@grafana.com>	2019-07-16 19:34:27 +02:00
beorn7	2180c2f3bf	Address first batch of old review comments Signed-off-by: beorn7 <beorn@grafana.com>	2019-07-16 19:14:17 +02:00
beorn7	b3b47f2d07	Make selector naming consistent Signed-off-by: beorn7 <beorn@grafana.com>	2019-07-10 20:09:01 +02:00
beorn7	dec5b5b053	Fix indentation Signed-off-by: beorn7 <beorn@grafana.com>	2019-07-10 20:07:20 +02:00
beorn7	9d7045e483	(Re-)adjust to Grafana gauge expecting percentage 0-100 (rather than 1-0) Signed-off-by: beorn7 <beorn@grafana.com>	2019-07-10 19:40:04 +02:00
beorn7	f331b308f3	Use promgrafonnet as a vendored library from its source The only deviation that happened so far is to use format="percentunit" in a Grafana gauge. This change wasn't even properly used in this repo so far, so I opted to stick with "upstream" for now. If changes are really needed, we can try to change upstream first. Another change was done in parallal here and upstream, but it was "more correct" in upstream. (Change datasource to $datasource variable, only partially applied here.) Which is another point for using the upstream and not copy it here. Signed-off-by: beorn7 <beorn@grafana.com>	2019-07-06 21:11:23 +02:00
beorn7	e5266c242e	Add README.md Signed-off-by: beorn7 <beorn@grafana.com>	2019-07-06 20:30:40 +02:00
beorn7	f2891703a5	Add Makefile to easily make output files and lint sources Signed-off-by: beorn7 <beorn@grafana.com>	2019-07-06 20:21:56 +02:00
beorn7	f17829c48b	Create jsonnet files to create output files This allows to create YAML files with rules and JSON files with dashboard descriptions. Signed-off-by: beorn7 <beorn@grafana.com>	2019-07-06 20:11:27 +02:00
beorn7	cd2981f1b8	Update vendoring to current location of jsonnet-libs Signed-off-by: beorn7 <beorn@grafana.com>	2019-07-06 20:10:47 +02:00
beorn7	2df034c055	Move node-mixin into docs directory Signed-off-by: beorn7 <beorn@grafana.com>	2019-07-05 19:38:03 +02:00
Cougar	764da30556	Add compat rules for node_time, node_memory_ShmemHugePages and node_memory_ShmemPmdMapped (#1138 ) Signed-off-by: Cougar <cougar@random.ee>	2018-11-05 16:40:19 +01:00
Ben Kochie	5d23ad0ca7	Fix supervisord collector (#978 ) * Replace supervisord xmlrpc library * Use `github.com/mattn/go-xmlrpc` that doesn't leak goroutines. * Fix uptime metric * Use Prometheus best practices for uptime metric. * Use "start time" rather than "uptime". * Don't emit a start time if the process is down. * Add changelog entry. * Add example compatibility rules. Signed-off-by: Ben Kochie <superq@gmail.com>	2018-08-06 16:54:46 +02:00
Rene Treffer	80a5712b97	Fix sample rules for migration (#1022 ) - add conversion from _ms to _seconds on disk metrics - add missing node_textfile_mtime section - add groups: header to pass promtool check rules Signed-off-by: Rene Treffer <rene.treffer@soundcloud.com>	2018-07-27 14:27:44 +02:00
Ivan Kiselev	ae90bac5b8	Add example of translating new metrics to old format in case of migration to 1.16 version (#982 ) Add additional example of how to save old metrics Signed-off-by: Ivan Kiselev <ivan@messagebird.com>	2018-07-02 12:39:32 +02:00
Roman Vynar	55c32fcf02	Add compat rules for filesystem collector. (#973 ) Signed-off-by: Roman Vynar <roman.vynar@goquiq.com>	2018-06-13 18:32:07 +02:00
Nicholas Capo	09d11817d0	docs: Add example recording rule for node_memory_MemAvailable Signed-off-by: Nicholas Capo <nicholas.capo@gmail.com>	2018-05-16 17:01:51 -05:00
Ben Kochie	c5a74ce1a1	Add label mangling. Signed-off-by: Ben Kochie <superq@gmail.com>	2018-05-14 12:24:05 +02:00
Ben Kochie	dc1972e9e3	Document upgrade options for v0.16.0 * Add an upgrade guide. * Add an example recording rules. Signed-off-by: Ben Kochie <superq@gmail.com>	2018-05-11 13:45:36 +02:00
Brian Brazil	52c031890e	Add _seconds suffix to node_time. (#823 )	2018-02-14 16:59:08 +00:00
Leonid Evdokimov	c169b4b1c5	Add metrics from SNTPv4 packet to ntp collector & add ntpd sanity check (#655 ) * Add metrics from SNTPv4 packet to ntp collector & add ntpd sanity check 1. Checking local clock against remote NTP daemon is bad idea, local ntpd acting as a client should do it better and avoid excessive load on remote NTP server so the collector is refactored to query local NTP server. 2. Checking local clock against remote one does not check local ntpd itself. Local ntpd may be down or out of sync due to network issues, but clock will be OK. 3. Checking NTP server using sanity of it's response is tricky and depends on ntpd implementation, that's why common `node_ntp_sanity` variable is exported. * `govendor add golang.org/x/net/ipv4`, it is dependency of github.com/beevik/ntp * Update github.com/beevik/ntp to include boring SNTP fix * Use variable name from RFC5905 * ntp: move code to make export of raw metrics more explicit * Move NTP math to `github.com/beevik/ntp` * Make `golint` happy * Add some brief docs explaining `ntp` #655 and `timex` #664 modules * ntp: drop XXX comment that got its decision * ntp: add `_seconds` suffix to relevant metrics * Better `node_ntp_leap` comment * s/node_ntp_reftime/node_ntp_reference_timestamp_seconds/ as requested by @discordianfish * Extract subsystem name to const as suggested by @SuperQ	2017-09-19 10:36:14 +02:00

50 commits