Add metrics that expose more information about MD RAID devices and
disks:
- the RAID level in use
- the RAID set that a disk belongs to
This allows for things like alert on unusually high I/O
utilisation for a disk compared to other disks in the same RAID set,
which usually means the disk is failing, and for comparing
write/read latency across RAID sets.
Output looks like:
node_md_disk_info{disk_device="/dev/dm-0", md_device="md1", md_set="A"} 1
node_md_disk_info{disk_device="/dev/dm-3", md_device="md1", md_set="B"} 1
node_md_disk_info{disk_device="/dev/dm-2", md_device="md1", md_set="A"} 1
node_md_disk_info{disk_device="/dev/dm-1", md_device="md1", md_set="B"} 1
node_md_disk_info{disk_device="/dev/dm-4", md_device="md1", md_set="A"} 1
node_md_disk_info{disk_device="/dev/dm-5", md_device="md1", md_set="B"} 1
node_md_info{md_device="md1", md_name="foo", raid_level="10", md_metadata_version="1.2"} 1
The `node_md_info` metric, which gives additional information about the
RAID array, is intentionally separate to avoid adding all of those
labels to each disk. If you need to query using the labels contained in
`node_md_info`, you can do that using PromQL:
https://www.robustperception.io/how-to-have-labels-for-machine-roles/
I looked at adding the array UUID, but there's no sysfs entry for it and
I'm not sure there's a strong use case for it.
This patch to add a sysfs entry for the UUID was apparently not
accepted:
https://www.spinics.net/lists/raid/msg40667.html
Add these metrics as a textfile script rather than adding them to the Go
'md' module as they're perhaps less commonly useful. If lots of people
find them useful, we can later rewrite this in Go.
Signed-off-by: Matt Bostock <mbostock@cloudflare.com>
* If NRestarts or NRefused are not available, don't ignore the unit itself
* Don't report systemd metrics (NRestarts/NRefused) that are not available
Signed-off-by: James Hartig <james@getadmiral.com>
PIDs can vanish (exit) from /proc/ between gathering the list of PIDs
and getting all of their stats.
* Ignore file not found errors.
* Explicitly count the PIDs we find.
* Cleanup some error style issues.
Signed-off-by: Ben Kochie <superq@gmail.com>
* Replace supervisord xmlrpc library
* Use `github.com/mattn/go-xmlrpc` that doesn't leak goroutines.
* Fix uptime metric
* Use Prometheus best practices for uptime metric.
* Use "start time" rather than "uptime".
* Don't emit a start time if the process is down.
* Add changelog entry.
* Add example compatibility rules.
Signed-off-by: Ben Kochie <superq@gmail.com>
* vendor: Update prometheus/procfs
Signed-off-by: Hannes Körber <hannes.koerber@haktec.de>
* mountstats: Use new NFS protocol field
In https://github.com/prometheus/procfs/pull/100, the NFSTransportStats
struct was expanded by a field called protocol that specifies the NFS
protocol in use, either "tcp" or "udp". This commit adds the protocol as
a label to all NFS metrics exported via the mountstats collector.
Signed-off-by: Hannes Körber <hannes.koerber@haktec.de>
* Update fixtures for UDP mount
Signed-off-by: Hannes Körber <hannes.koerber@haktec.de>
It is quite common to put /var/lib/docker itself on a separate partition
and that should be monitored as well.
Signed-off-by: Johannes Wienke <languitar@semipol.de>
While the statfs(2) approach is reliable for normally mounted filesystems, the
flags returned can be inconsistent when filesystem has been remounted read-only
after encountering an error. The returned flags do accurately represent the
internal state of the filesystem, but they do not reflect whether the VFS layer
will accept writes. Instead, it makes sense to parse the current VFS mount
state from the options field in /proc/mounts since it takes precedence.
Signed-off-by: Brandon Gilmore <bgilmore@valvesoftware.com>
* add sys/class/net parsing from procfs and expose its metrics
Signed-off-by: Jan Klat <jenik@klatys.cz>
* change code to use int pointers per procfs change, move netclass to separate collector, change metric naming
Signed-off-by: Jan Klat <jenik@klatys.cz>
* bump year in licence, remove redundant newline, correct fixtures
Signed-off-by: Jan Klat <jenik@klatys.cz>
* fix style
Signed-off-by: Jan Klat <jenik@klatys.cz>
* change carrier changes to counter type
Signed-off-by: Jan Klat <jenik@klatys.cz>
* fix e2e output
Signed-off-by: Jan Klat <jenik@klatys.cz>
* add fixtures
Signed-off-by: Jan Klat <jenik@klatys.cz>
* update vendor, use fixtures correctly
Signed-off-by: Jan Klat <jenik@klatys.cz>
* change fixtures (device in /sys/class/net should be symlinked)
Signed-off-by: Jan Klat <jenik@klatys.cz>
* correct fixtures for 64k page, updated readme
Signed-off-by: Jan Klat <jenik@klatys.cz>
Fixed spelling mistakes.
Update transport_generic.go
Changed to a mutex approach instead of channels and added a timeout before declaring a mount stuck.
Removed unnecessary lock channel and clarified some var names.
Fixed style nits.
Signed-off-by: Mark Knapp <mknapp@hudson-trading.com>
* Add support for NRestarts counter introduced in systemd 235
`.service` units increment this counter any time the Restart= condition is
triggered.
Signed-off-by: Matthew McGinn <mamcgi@gmail.com>
* Send "Personality unknown" to debug, not info, remove unnecessary newline.
* Add support for "linear" personality.
* Always set number of active disks to 0 when a device is inactive.
* Add total disks calculation to unknown personalites.
Signed-off-by: Ben Kochie <superq@gmail.com>
* Fix for #945, cpu temperature is signed.
Added a type conversion to cpu temperature sysctl. Will still
collect/report -1 when the value is -1, this is because it should be up
to interpretation whether this is the correct value for the system or
not.
Some drivers will report -1 for cpu temperature. Other sensors will
report "an input into the fan control algorithm", i.e. not the actual
temperature, but how much fan it wants. Some people cool their machines
with liquid nitrogen.
Signed-off-by: Derek Marcotte <554b8425@razorfever.net>
Add metrics that count how many running processes are linking to deleted
libraries on each machine. Deleted libraries are usually outdated
libraries, and outdated libraries may have known security
vulnerabilities.
The rationale behind storing these as metrics is allow the rollout of
security fixes to be tracked across a fleet of machines, ensuring that
all affected processes are restarted (e.g. via a reboot).
I'm parsing the output from `/proc/*/maps` because it's using `lsof -d
DEL` can be too slow, particularly if you have sockets that bind to
thousands of IP addresses.
The metric labels include the library path and the base filename, which
allows us to pinpoint the exact path of the deleted library but also
allows us to aggregate on the library name (or approximations of it)
even if library locations differ between operating system versions.
The metrics output and the CPU time consumed is as follows:
user@host:~$ time sudo python processes.py
# HELP node_processes_linking_deleted_libraries Count of running processes that link a deleted library
# TYPE node_processes_linking_deleted_libraries gauge
node_processes_linking_deleted_libraries{library_path="locale-archive", library_name="/usr/lib/locale"} 3
node_processes_linking_deleted_libraries{library_path="libevent-2.0.so.5.1.9", library_name="/usr/lib/x86_64-linux-gnu"} 4
real 0m0.071s
user 0m0.030s
sys 0m0.041s
Including the library filename and path will result in reasonably high
metrics cardinality, however I think the benefits when an urgent
security patch is being deployed outweigh concerns around cardinality.
This script assumes that library files do not contain spaces in their
path.
Signed-off-by: Matt Bostock <mbostock@cloudflare.com>
The directory size text collector example uses the wrong metric name in the HELP and TYPE lines rendering the comments unusable.
This fixes that by using the same metric name.
Signed-off-by: Sandor Zeestraten <sandor@zeestrataca.com>