Commit graph

26 commits

Author SHA1 Message Date
Matt Bostock 9e0aee8ae7 Add metrics exposing extended md RAID info (#958)
Add metrics that expose more information about MD RAID devices and
disks:

- the RAID level in use
- the RAID set that a disk belongs to

This allows for things like alert on unusually high I/O
utilisation for a disk compared to other disks in the same RAID set,
which usually means the disk is failing, and for comparing
write/read latency across RAID sets.

Output looks like:

    node_md_disk_info{disk_device="/dev/dm-0", md_device="md1", md_set="A"} 1
    node_md_disk_info{disk_device="/dev/dm-3", md_device="md1", md_set="B"} 1
    node_md_disk_info{disk_device="/dev/dm-2", md_device="md1", md_set="A"} 1
    node_md_disk_info{disk_device="/dev/dm-1", md_device="md1", md_set="B"} 1
    node_md_disk_info{disk_device="/dev/dm-4", md_device="md1", md_set="A"} 1
    node_md_disk_info{disk_device="/dev/dm-5", md_device="md1", md_set="B"} 1
    node_md_info{md_device="md1", md_name="foo", raid_level="10", md_metadata_version="1.2"} 1

The `node_md_info` metric, which gives additional information about the
RAID array, is intentionally separate to avoid adding all of those
labels to each disk. If you need to query using the labels contained in
`node_md_info`, you can do that using PromQL:
https://www.robustperception.io/how-to-have-labels-for-machine-roles/

I looked at adding the array UUID, but there's no sysfs entry for it and
I'm not sure there's a strong use case for it.

This patch to add a sysfs entry for the UUID was apparently not
accepted:
https://www.spinics.net/lists/raid/msg40667.html

Add these metrics as a textfile script rather than adding them to the Go
'md' module as they're perhaps less commonly useful. If lots of people
find them useful, we can later rewrite this in Go.

Signed-off-by: Matt Bostock <mbostock@cloudflare.com>
2018-08-18 08:57:51 +00:00
Bernd Müller ee1e1997bc Add scsi smart data to prometheus exporter (#862)
Add scsi smart data to prometheus exporter

Signed-off-by: mueller <mueller@b1-systems.de>
2018-07-04 00:30:20 +02:00
Matt Bostock f56e8fcdf4 Fix spelling of celsius in IPMI example script (#967)
'Celsius' should be spelt with an 's':
https://en.wikipedia.org/wiki/Celsius

Signed-off-by: Matt Bostock <mbostock@cloudflare.com>
2018-06-08 19:21:19 +02:00
Matt Bostock 516e5d4beb Add metric for outdated libraries (#957)
Add metrics that count how many running processes are linking to deleted
libraries on each machine. Deleted libraries are usually outdated
libraries, and outdated libraries may have known security
vulnerabilities.

The rationale behind storing these as metrics is allow the rollout of
security fixes to be tracked across a fleet of machines, ensuring that
all affected processes are restarted (e.g. via a reboot).

I'm parsing the output from `/proc/*/maps` because it's using `lsof -d
DEL` can be too slow, particularly if you have sockets that bind to
thousands of IP addresses.

The metric labels include the library path and the base filename, which
allows us to pinpoint the exact path of the deleted library but also
allows us to aggregate on the library name (or approximations of it)
even if library locations differ between operating system versions.

The metrics output and the CPU time consumed is as follows:

    user@host:~$ time sudo python processes.py
    # HELP node_processes_linking_deleted_libraries Count of running processes that link a deleted library
    # TYPE node_processes_linking_deleted_libraries gauge
    node_processes_linking_deleted_libraries{library_path="locale-archive", library_name="/usr/lib/locale"} 3
    node_processes_linking_deleted_libraries{library_path="libevent-2.0.so.5.1.9", library_name="/usr/lib/x86_64-linux-gnu"} 4

    real        0m0.071s
    user        0m0.030s
    sys 0m0.041s

Including the library filename and path will result in reasonably high
metrics cardinality, however I think the benefits when an urgent
security patch is being deployed outweigh concerns around cardinality.

This script assumes that library files do not contain spaces in their
path.

Signed-off-by: Matt Bostock <mbostock@cloudflare.com>
2018-05-25 18:20:42 +02:00
Sandor Zeestraten 578d814744 Fix metric name in directory size text collector example
The directory size text collector example uses the wrong metric name in the HELP and TYPE lines rendering the comments unusable.

This fixes that by using the same metric name.

Signed-off-by: Sandor Zeestraten <sandor@zeestrataca.com>
2018-05-19 21:11:46 +02:00
mueller 770f420066 added additional smartmonattrs
Signed-off-by: mueller <mueller@b1-systems.de>
2018-03-22 11:14:25 +01:00
Ben Kochie 483f59d110
Document use of atomic wrapper (#781)
Document how to use `sponge` to atomic update textfiles.
2018-02-27 19:46:01 +01:00
anarcat 79ae03c4c7 add sample directory size exporter (#789)
* add sample directory size exporter

This is a possible workaround for the lack of metrics in the new
storage backend, as documented in:

https://github.com/prometheus/prometheus/issues/3684

Partly inspired by this post as well:

https://www.robustperception.io/monitoring-directory-sizes-with-the-textfile-collector/

* properly escape backslashes and double-quotes
2018-02-21 16:24:48 +01:00
tobald 2978728b00 Fix apt.sh syntax (#811)
This patch fixes:

./apt.test: command substitution: line 19: syntax error near unexpected token `|'
./apt.test: command substitution: line 19: `  | /usr/bin/sort   | /usr/bin/uniq -c   | awk '{ gsub(/\\\\/,
2018-02-05 20:43:25 +01:00
Shevchenko Vitaliy 4ed49e73fb Escape double quotes in device model family (#772) 2018-01-24 11:35:14 +01:00
Ben Kochie 1ad5ba4dc7
Fix smartmon.sh bugs (#792)
* Fix smartmon.sh info label consistency.

* Fix parsing of SMART-ID attributes <= 99.
2018-01-22 16:51:20 +01:00
Bruce Lee 8d3484d0ca Update storcli.py (#783) 2018-01-09 09:10:30 +01:00
Mario Trangoni a40f7e78da StorCli text collector: fix pylint issues and handle StorCli not installed (#758)
* StorCli text collector: fix pylint issues and handle StorCli not installed

* StorCli text collector: Add HELP and TYPE strings.
2017-12-12 18:48:06 +01:00
Filippo Giunchedi af4cf20b46 apt.sh: handle multiple origins in apt-get output (#757)
It might happen that a given upgrade comes from multiple origins, in
which case the origins are separated by ", " and thus breaking
whitespace-based split. For example:

Inst package [1.2.3] (1.2.4 Debian:8.10/oldstable, Debian-Security:8/oldstable [amd64])

To workaround this case, mangle the apt-get output to remove whitespaces from
the origins list.
2017-12-12 10:45:59 +01:00
Derek Marcotte 1527789f76 Added text collector conversion for ipmitool output. (#746)
* Added text collector conversion for ipmitool output.

* Sort metrics before exporting, add namespace.

* Added HELP string, tidy up a bit.

* Make status a gauge.
2017-12-01 12:58:39 +01:00
William 6ecd8780d9 added Wear_Leveling_Count attribute to smartmon.sh script (#707) 2017-10-19 19:20:43 +02:00
Ben Kochie 1824ac3b9e Fix smartmon.sh textfile script (#700)
When there are no SMART compatible devices (Raspberry Pi for example) an
error is returned, but the return code is still 0.

`# scan_smart_devices: glob(3) aborted matching pattern /dev/discs/disc*`

* Remove unused `disks` variable.
* Filter for only valid `/dev` devices.
2017-10-18 07:37:47 +02:00
Ben Kochie a47f033f1b Add text file helper for apt-get. (#680)
* Add metric for pending upgrades.
* Add metric for pending reboot required.
2017-10-04 08:34:30 +02:00
Matt Bostock 89a2f21f45 Always try to return smartmon_device_info metric (#663)
* Always try to return smartmon_device_info metric

Sometimes the 'model family' field is not returned by `smartctl' because
a disk is not in the disk database for the version of smartmontools
installed on the system.

In those cases, the device model and serial number is still returned (at
least as far as I have observed.

Re-work the logic to prefer the 'vendor' field first, and if not
present, always output a `smartmon_device_info` metric even if some
labels have empty values.

On the box I'm testing this on, where previously no metric was returned,
it now returns:

    # HELP smartmon_device_info SMART metric device_info
    # TYPE smartmon_device_info gauge
    smartmon_device_info{disk="/dev/sda",type="sat",model_family="",device_model="INTEL REDACTED",serial_number="REDACTED",firmware_version="REDACTED"} 1
    smartmon_device_info{disk="/dev/sdb",type="sat",model_family="",device_model="INTEL REDACTED",serial_number="REDACTED",firmware_version="REDACTED"} 1
    smartmon_device_info{disk="/dev/sdc",type="sat",model_family="",device_model="INTEL REDACTED",serial_number="REDACTED",firmware_version="REDACTED"} 1
    smartmon_device_info{disk="/dev/sdd",type="sat",model_family="",device_model="INTEL REDACTED",serial_number="REDACTED",firmware_version="REDACTED"} 1
    smartmon_device_info{disk="/dev/sde",type="sat",model_family="",device_model="INTEL REDACTED",serial_number="REDACTED",firmware_version="REDACTED"} 1
    smartmon_device_info{disk="/dev/sdf",type="sat",model_family="",device_model="INTEL REDACTED",serial_number="REDACTED",firmware_version="REDACTED"} 1

* Add trailing newline

Because POSIX:
https://stackoverflow.com/a/729795
2017-08-31 18:00:42 +02:00
William Cooley 977aa94bd3 Added metric for overall health status check to smartmon.sh example script 2017-04-05 10:51:58 -04:00
Rene Treffer d61fef8ce6 Handle smart raw values >2^31
"%d" in awk will truncate values at 2^31. S.M.A.R.T. values can exceed that, thus use a floating point notation instead to encode larger values (at the possible cost of some precision).
2017-03-21 10:47:27 +01:00
Ben Kochie 58c10628d8 Add ntpd metrics from ntpq rv
Add some metrics using to the ntpd helper script using the "request
value"[0] command.

[0]: https://www.eecis.udel.edu/~mills/ntp/html/ntpq.html#system
2017-02-14 16:20:53 +01:00
Ben Kochie bde6e5d290 Add a textfile helper for NTPd.
Parse the output of `ntpq -np` to provide metrics from a local NTP
daemon.
2017-02-10 16:38:39 +01:00
Matt Bostock 004bdca8e5 Add text_collector_examples README 2016-12-22 22:57:14 +00:00
Matt Bostock 2c02571040 Add StorCli text collector example script
Collect metrics from the StorCLI utility on the health of MegaRAID
hardware RAID controllers and write them to stdout so that they can be
used by the textfile collector.

We parse the JSON output that StorCLI provides.

Script must be run as root or with appropriate capabilities for storcli
to access the RAID card.

Designed to run under Python 2.7, using the system Python provided with
many Linux distributions.

The metrics look like this:

    mbostock@host:~$ sudo ./storcli.py
    megaraid_status_code 0
    megaraid_controllers_count 1
    megaraid_emergency_hot_spare{controller="0"} 1
    megaraid_scheduled_patrol_read{controller="0"} 1
    megaraid_virtual_drives{controller="0"} 1
    megaraid_drive_groups{controller="0"} 1
    megaraid_virtual_drives_optimal{controller="0"} 1
    megaraid_degraded{controller="0"} 0
    megaraid_battery_backup_healthy{controller="0"} 1
    megaraid_ports{controller="0"} 8
    megaraid_failed{controller="0"} 0
    megaraid_drive_groups_optimal{controller="0"} 1
    megaraid_healthy{controller="0"} 1
    megaraid_physical_drives{controller="0"} 24
    megaraid_controller_info{controller="0", model="AVAGOMegaRAIDSASPCIExpressROMB"} 1
    mbostock@host:~$
2016-12-22 22:55:58 +00:00
Ben Kochie 0d2314e2b4 Add text file utility for SMART metrics
Add a utility to parse the output of `smartctl`.
* Scans all disks.
* Prints metrics for `smartctl --info`.
* Prints metrics for `smartctl --attributes`.
2016-11-27 14:32:32 +01:00