Add this new metric (where sda is active and sdb is in standby mode):
smartmon_device_active{disk="/dev/sda",type="sat"} 1
smartmon_device_active{disk="/dev/sdb",type="sat"} 0
Also skip further metrics if the drive is in a low-power mode. This
prevents spinning up disks just to get the metrics (which matches e.g.
debian's default behavior for smartd).
Signed-off-by: Andre Heider <a.heider@gmail.com>
We use the output-compatible perccli and storcli.py does not handle 'Unknown' as a result:
```
sg="Error parsing \"/var/lib/node_exporter/perccli.prom\": text format parsing error in line 222: expected float as value, got \"Unknown\"" source="textfile.go:212"
```
I know, the perccli should not return 'Unknown' but this error breaks all other useful measurements because the prom file is not parsable. My if condition fixes this.
Signed-off-by: Andreas Wirooks <andreas.wirooks@1und1.de>
* storcli.py: Remove IntEnum
This removes an external dependency.
Moved VD state to VD info labels
* storcli.py: Fix BBU health detection
BBU Status is 0 for a healthy cache vault and 32 for a healthy BBU.
* storcli.py: Strip all strings from PD
Strip all strings that we get from PDs.
They often contain whitespaces....
* storcli.py: Add formatting options
Add help text explaining how this documented was formatted
* storcli.py: Add DG to pd_info label
Add disk group to pd_info.
That way we can relate to PDs in the same DG.
For example to check if all disks in one RAID
use the same interface...
* storcli.py: Fix promtool issues
Fix linting issues reported by promtool check-metrics
* storcli.py: Exit if storcli reports issues
storcli reports if the command was a success.
We should not continue if there are issues.
* storcli.py: Try to parse metrics to float
This will sanitize the values we hand over to
node_exporter - eliminating any unforeseen values we read out...
* storcli.py: Refactor code to implement handle_sas_controller()
Move code into methods so that we can now also support HBA queries.
* storcli.py: Sort inputs
"...like a good python developer"
- Daniel Swarbrick
* storcli.py: Replace external dateutil library with internal datetime
Removes external dependency...
* storcli.py: Also collect temperature on megaraid cards
We have already collected them on mpt3sas cards...
* storcli.py: Clean up old code
Removed dead code that is not used any more.
* storcli.py: strip() all information for labels
They often contain whitespaces...
* storcli.py: Try to catch KeyErrors generally
If some key we expect is not there, we will want to
still print whatever we have collected so far...
* storcli.py: Increment version number
We have made some changes here and there.
The general look of the data has not been changed.
* storcli.py: Fix CodeSpell issue
Split string to avoid issues with Codespell due to Celcius in JSON Key
Signed-off-by: Christopher Blum <zeichenanonym@web.de>
* deleted_libraries: Upgrade to Python 3
Python 2.7 will not be maintained past 2020. Therefore upgrade
text_collector_examples/deleted_libraries.py to Python 3.
* Add mellanox_hca_temp text collector example
mellanox_hca_temp is a script that reads Mellanox HCA temperature using
the Mellanox mget_temp_ext tool.
Signed-off-by: Benjamin Drung <benjamin.drung@cloud.ionos.com>
* textfile smartmon.sh
Added functions to also parse megaraid disks.
Added parsing to also detect the grown_defects counters.
* textfile storcli.py
Reworked the example file to export lots more information about
megaraid attached controllers, VDs and PDs.
Signed-off-by: Christopher Blum <christopher.blum@profitbricks.com>
Add metrics that expose more information about MD RAID devices and
disks:
- the RAID level in use
- the RAID set that a disk belongs to
This allows for things like alert on unusually high I/O
utilisation for a disk compared to other disks in the same RAID set,
which usually means the disk is failing, and for comparing
write/read latency across RAID sets.
Output looks like:
node_md_disk_info{disk_device="/dev/dm-0", md_device="md1", md_set="A"} 1
node_md_disk_info{disk_device="/dev/dm-3", md_device="md1", md_set="B"} 1
node_md_disk_info{disk_device="/dev/dm-2", md_device="md1", md_set="A"} 1
node_md_disk_info{disk_device="/dev/dm-1", md_device="md1", md_set="B"} 1
node_md_disk_info{disk_device="/dev/dm-4", md_device="md1", md_set="A"} 1
node_md_disk_info{disk_device="/dev/dm-5", md_device="md1", md_set="B"} 1
node_md_info{md_device="md1", md_name="foo", raid_level="10", md_metadata_version="1.2"} 1
The `node_md_info` metric, which gives additional information about the
RAID array, is intentionally separate to avoid adding all of those
labels to each disk. If you need to query using the labels contained in
`node_md_info`, you can do that using PromQL:
https://www.robustperception.io/how-to-have-labels-for-machine-roles/
I looked at adding the array UUID, but there's no sysfs entry for it and
I'm not sure there's a strong use case for it.
This patch to add a sysfs entry for the UUID was apparently not
accepted:
https://www.spinics.net/lists/raid/msg40667.html
Add these metrics as a textfile script rather than adding them to the Go
'md' module as they're perhaps less commonly useful. If lots of people
find them useful, we can later rewrite this in Go.
Signed-off-by: Matt Bostock <mbostock@cloudflare.com>
Add metrics that count how many running processes are linking to deleted
libraries on each machine. Deleted libraries are usually outdated
libraries, and outdated libraries may have known security
vulnerabilities.
The rationale behind storing these as metrics is allow the rollout of
security fixes to be tracked across a fleet of machines, ensuring that
all affected processes are restarted (e.g. via a reboot).
I'm parsing the output from `/proc/*/maps` because it's using `lsof -d
DEL` can be too slow, particularly if you have sockets that bind to
thousands of IP addresses.
The metric labels include the library path and the base filename, which
allows us to pinpoint the exact path of the deleted library but also
allows us to aggregate on the library name (or approximations of it)
even if library locations differ between operating system versions.
The metrics output and the CPU time consumed is as follows:
user@host:~$ time sudo python processes.py
# HELP node_processes_linking_deleted_libraries Count of running processes that link a deleted library
# TYPE node_processes_linking_deleted_libraries gauge
node_processes_linking_deleted_libraries{library_path="locale-archive", library_name="/usr/lib/locale"} 3
node_processes_linking_deleted_libraries{library_path="libevent-2.0.so.5.1.9", library_name="/usr/lib/x86_64-linux-gnu"} 4
real 0m0.071s
user 0m0.030s
sys 0m0.041s
Including the library filename and path will result in reasonably high
metrics cardinality, however I think the benefits when an urgent
security patch is being deployed outweigh concerns around cardinality.
This script assumes that library files do not contain spaces in their
path.
Signed-off-by: Matt Bostock <mbostock@cloudflare.com>
The directory size text collector example uses the wrong metric name in the HELP and TYPE lines rendering the comments unusable.
This fixes that by using the same metric name.
Signed-off-by: Sandor Zeestraten <sandor@zeestrataca.com>
It might happen that a given upgrade comes from multiple origins, in
which case the origins are separated by ", " and thus breaking
whitespace-based split. For example:
Inst package [1.2.3] (1.2.4 Debian:8.10/oldstable, Debian-Security:8/oldstable [amd64])
To workaround this case, mangle the apt-get output to remove whitespaces from
the origins list.
* Added text collector conversion for ipmitool output.
* Sort metrics before exporting, add namespace.
* Added HELP string, tidy up a bit.
* Make status a gauge.
When there are no SMART compatible devices (Raspberry Pi for example) an
error is returned, but the return code is still 0.
`# scan_smart_devices: glob(3) aborted matching pattern /dev/discs/disc*`
* Remove unused `disks` variable.
* Filter for only valid `/dev` devices.
* Always try to return smartmon_device_info metric
Sometimes the 'model family' field is not returned by `smartctl' because
a disk is not in the disk database for the version of smartmontools
installed on the system.
In those cases, the device model and serial number is still returned (at
least as far as I have observed.
Re-work the logic to prefer the 'vendor' field first, and if not
present, always output a `smartmon_device_info` metric even if some
labels have empty values.
On the box I'm testing this on, where previously no metric was returned,
it now returns:
# HELP smartmon_device_info SMART metric device_info
# TYPE smartmon_device_info gauge
smartmon_device_info{disk="/dev/sda",type="sat",model_family="",device_model="INTEL REDACTED",serial_number="REDACTED",firmware_version="REDACTED"} 1
smartmon_device_info{disk="/dev/sdb",type="sat",model_family="",device_model="INTEL REDACTED",serial_number="REDACTED",firmware_version="REDACTED"} 1
smartmon_device_info{disk="/dev/sdc",type="sat",model_family="",device_model="INTEL REDACTED",serial_number="REDACTED",firmware_version="REDACTED"} 1
smartmon_device_info{disk="/dev/sdd",type="sat",model_family="",device_model="INTEL REDACTED",serial_number="REDACTED",firmware_version="REDACTED"} 1
smartmon_device_info{disk="/dev/sde",type="sat",model_family="",device_model="INTEL REDACTED",serial_number="REDACTED",firmware_version="REDACTED"} 1
smartmon_device_info{disk="/dev/sdf",type="sat",model_family="",device_model="INTEL REDACTED",serial_number="REDACTED",firmware_version="REDACTED"} 1
* Add trailing newline
Because POSIX:
https://stackoverflow.com/a/729795
"%d" in awk will truncate values at 2^31. S.M.A.R.T. values can exceed that, thus use a floating point notation instead to encode larger values (at the possible cost of some precision).
Collect metrics from the StorCLI utility on the health of MegaRAID
hardware RAID controllers and write them to stdout so that they can be
used by the textfile collector.
We parse the JSON output that StorCLI provides.
Script must be run as root or with appropriate capabilities for storcli
to access the RAID card.
Designed to run under Python 2.7, using the system Python provided with
many Linux distributions.
The metrics look like this:
mbostock@host:~$ sudo ./storcli.py
megaraid_status_code 0
megaraid_controllers_count 1
megaraid_emergency_hot_spare{controller="0"} 1
megaraid_scheduled_patrol_read{controller="0"} 1
megaraid_virtual_drives{controller="0"} 1
megaraid_drive_groups{controller="0"} 1
megaraid_virtual_drives_optimal{controller="0"} 1
megaraid_degraded{controller="0"} 0
megaraid_battery_backup_healthy{controller="0"} 1
megaraid_ports{controller="0"} 8
megaraid_failed{controller="0"} 0
megaraid_drive_groups_optimal{controller="0"} 1
megaraid_healthy{controller="0"} 1
megaraid_physical_drives{controller="0"} 24
megaraid_controller_info{controller="0", model="AVAGOMegaRAIDSASPCIExpressROMB"} 1
mbostock@host:~$
Add a utility to parse the output of `smartctl`.
* Scans all disks.
* Prints metrics for `smartctl --info`.
* Prints metrics for `smartctl --attributes`.