Commit graph

131 commits

Author SHA1 Message Date
Brian Brazil 31ce32f1fe
Greatly trim what netstat collector exposes by default (#876)
Netstat is 40% of the metrics on my laptop, many of which
are highly detailed information about IP internals in the kernel.
~300 such metrics on every machine in your fleet is excessive,
so focus on key metrics by default, overridable by the user.

Fixes #515

Signed-off-by: Brian Brazil <brian.brazil@robustperception.io>
2018-03-30 19:28:08 +01:00
Ben Kochie cf3edadcbb Update fixtures
* Add oom_kill to fixture.
* Update e2e outputs.
* Put regexp in order.

Signed-off-by: Ben Kochie <superq@gmail.com>
2018-03-29 22:00:02 +01:00
Brian Brazil 499c342fed Greatly reduce the metrics vmstat returns by default.
Vmstat has over 100 fields, most of which are highly
detailed debug information. Trim this down to only
essential fields by default, configurable by flag.

Signed-off-by: Brian Brazil <brian.brazil@robustperception.io>
2018-03-29 22:00:02 +01:00
Ben Kochie 779090db7e
Update ppc64le fixture (#867)
Update to match standard e2e output.

Signed-off-by: Ben Kochie <superq@gmail.com>
2018-03-27 17:05:20 +02:00
Mario Trangoni 1f11a86d59 Fix nfs golint issues (#863)
* procfs: update vendoring

Signed-off-by: Mario Trangoni <mjtrangoni@gmail.com>

* procfs: fix e2e tests after nfs changes

Signed-off-by: Mario Trangoni <mjtrangoni@gmail.com>
2018-03-22 22:25:37 +01:00
Ben Kochie 7b720df1c5
Use lowercase cpu label name in interrupts (#849)
To match other CPU related metric labels, use a lowercase named label.
2018-03-08 15:04:49 +01:00
Julius Volz 864a6ee935 Treat custom textfile metric timestamps as errors (#769)
This is clearer behavior and users will notice and fix their textfiles faster
than if we just output a warning.
2018-02-27 19:43:38 +01:00
Rene Treffer c504c7e264 Only report core throttles per core, not per cpu (#836)
* Only report core throttles per core, not per cpu

* Add topology/core_id to the cpu sysfs fixtures

* Add new cpu fixtures to ttar file

* Merge core_id reading and thermal throttle accounting

* Declare core_id
2018-02-27 19:43:15 +01:00
Ben Kochie e0d54a509c
Cleanup NFS metrics (#834)
* Cleanup NFS metrics

* Update `nfs` metric names to match `nfsd`.
* Remove uneeded `tcp` label from TCP connections metric.
* Remove uneeded `v` on `nfsd` metrics.
* Enable all `nfs` v4 client metrics.
* Remove `nfs` metric name overrides.

* Add ppc64le fixture.

* Fix typo.
2018-02-21 07:25:41 +01:00
Ben Kochie 3f41a2fecb
Update ppc64le fixture (#832)
Updates fixture for ppc64le arch to latest output.
2018-02-19 20:43:33 +01:00
Ben Kochie d33a447047
Remove deprecated prometheus.InstrumentHandlerFunc (#831)
Update Prometheus client golang use to use `promhttp.Handler()` instead
of `prometheus.InstrumentHandlerFunc()`.
2018-02-19 15:44:59 +01:00
Richard Elling d7348a5c78 updates for zfsonlinux 0.7.5 (#779)
* updates for zfsonlinux 0.7.5

* add constants for KSTAT_DATA_* types

* added e2e test for negative values represented by uint64 that can result from ZFS bugs
2018-02-16 15:46:31 +01:00
Ben Kochie 3de2542d21
Fix NFSd metric type (#819)
RPC Count should be a counter, not a gauge.
2018-02-13 17:03:22 +01:00
Matt Layher 544488ddd6 Fix remaining metric naming issues (#799) 2018-02-12 18:53:31 +01:00
Ben Kochie 6a041692ed
Add NFS Server metrics collector. (#803)
* Add NFS Server metrics collector.

* Add File Handles metrics.

* Add nfsd IO stats.

* Add metrics for NFSd threads.

* Add metrics for NFSd read ahead cache.

* Add NFSd network traffic counters.

* Add RPC metrics.

* Add V2 requests metrics.

* Add NFSv3 metrics.

* Add NFSv4 metrics.

* Update reply cache comment.

* Update help text.
2018-02-12 17:56:05 +01:00
Ben Kochie 14d60958d6
Unify CPU collector conventions (#806)
* Unify CPU collector conventions

Add a common CPU metric description.
* All collectors use the same `nodeCpuSecondsDesc`.
* All collectors drop the `cpu` prefix for `cpu` label values.

* Fix subsystem string in cpu_freebsd.

* Fix Linux CPU freq label names.
2018-02-01 18:42:20 +01:00
Ben Kochie 111e3af437
Remove obsolete megacli collector. (#798)
This collector has been replaced by the textfile collector tool
`storcli.py`.
2018-01-23 11:25:42 +01:00
Julius Volz 6cac74f0e0
Add unit suffix to textfile collector mtime metric (#796) 2018-01-22 14:02:19 +01:00
Brian Brazil a98067a294 Make metrics better follow guidelines (#787)
* Improve stat linux metric names.

cpu is no longer used.

* node_cpu -> node_cpu_seconds_total for Linux

* Improve filesystem metric names with units

* Improve units and names of linux disk stats

Remove sector metrics, the bytes metrics cover those already.

* Infiniband counters should end in _total

* Improve timex metric names, convert to more normal units.

See
3c073991eb/kernel/time/ntp.c (L909)
for what stabil means, looks like a moving average of some form.

* Update test fixture

* For meminfo metrics that had "kB" units, add _bytes

* Interrupts counter should have _total
2018-01-17 17:55:55 +01:00
Ben Kochie b4d7ba119a
Add fixture for ppc64le (#785)
* Add support for per-architecture fixtures.
* Add output for ppc64le.
2018-01-11 13:56:19 +01:00
Julius Volz f536857ac6
Fix e2e tests after textfile custom timestamp removal (#768) 2017-12-24 11:54:33 +01:00
Shubheksha Jalan 1f2458f42c Filter out testfile metrics correctly when using collect[] filters (#763)
* remove injection hook for textfile metrics, convert them to prometheus format

* add support for summaries

* add support for histograms

* add logic for handling inconsistent labels within a metric family for counter, gauge, untyped

* change logic for parsing the metrics textfile

* fix logic to adding missing labels

* Export time and error metrics for textfiles

* Add tests for new textfile collector, fix found bugs

* refactor Update() to split into smaller functions

* remove parseTextFiles(), fix import issue

* add mtime metric directly to channel, fix handling of mtime during testing

* rename variables related to labels

* refactor: add default case, remove if guard for metrics, remove extra loop and slice

* refactor: remove extra loop iterating over metric families

* test: add test case for different metric type, fix found bug

* test: add test for metrics with inconsistent labels

* test: add test for histogram

* test: add test for histogram with extra dimension

* test: add test for summary

* test: add test for summary with extra dimension

* remove unnecessary creation of protobuf

* nit: remove extra blank line
2017-12-23 20:21:58 +01:00
Ben Kochie cd2a17176a
Add full make to CircleCI (#761)
* Add full make to CircleCI

Ensure end-to-end test is run.

* Fix go fmt error.

* Fix end-to-end output.
2017-12-21 16:24:23 +01:00
Ben Kochie 2a80537547
Split out guest cpu metrics on Linux. (#744)
Linux "guest" metrics for VMs are already accounted for in node_cpu
`user` and `nice` metrics.  Separate these into their own metric to
avoid duplication of data.
2017-11-23 15:04:47 +01:00
Karsten Weiss a8d7d1101a cpu: Support processor-less (memory-only) NUMA nodes (#734)
* cpu: Support processor-less (memory-only) NUMA nodes

Processor-less (memory-only) NUMA nodes exist e.g. in systems that use
Intel Optane drives for RAM expansion using Intel Memory Drive
Technology (IMDT).

IMDT RAM expansion supports two modes:

* "Unify Remote Memory domains": present a processor-less (memory-only)
  NUMA domain, which is the default
* "Expand local memory domains": to expand each processor’s memory domain
  with a portion of the memory made available by Optane and IMDT

This commit fixes a crash in the first case (when "cpulist" is empty).

Here's an example of such a system:

$ numastat -m|head -n5

Per-node system memory usage (in MBs):
                          Node 0          Node 1          Node 2           Total
                 --------------- --------------- --------------- ---------------
MemTotal               118239.56       130816.00       464384.00       713439.56

$ for i in {0..2}; do echo -n "$i: " ; cat /sys/bus/node/devices/node$i/cpulist ; done
0: 0-7,16-23
1: 8-15,24-31
2:

$ /opt/vsmp/bin/vsmpversion -vvv
Memory Drive Technology: 8.2.1455.74 (Sep 28 2017 13:09:59)
System configuration:
    Boards:      3
       1 x Proc. + I/O + Memory
       2 x NVM devices (Intel SSDPED1K375GAQ)
    Processors:  2, Cores: 16, Threads: 32
        Intel(R) Xeon(R) CPU E5-2667 v4 @ 3.20GHz Stepping 01
    Memory (MB): 713472 (of 977450), Cache: 251416, Private: 12562
       1 x 249088MB   [262036/   678/12270]
       1 x 232192MB   [357707/125369/  146]  82:00.0#1
       1 x 232192MB   [357707/125369/  146]  83:00.0#1

* cpu: rename some variables (pkg => node)

* cpu: Use %v not %q in log.Debugf() format strings
2017-11-10 15:31:26 +01:00
Ben Kochie ea250d73f4
Fix off by one in Linux interrupts collector (#721)
* Fix off by one in Linux interrupts collector

* Fix off by one in CPU column handler.
* Add test.

* Enable interrupts in end-to-end test.
2017-11-02 09:59:46 +01:00
Matt Layher f9ad88fc03
xfs: expose correct fields, fix metric names 2017-10-20 18:41:51 -04:00
Ben Kochie deadfef4c9 Update vendoring (#685)
* Update vendor github.com/coreos/go-systemd/dbus@v15

* Update vendor github.com/ema/qdisc

* Update vendor github.com/godbus/dbus

* Update vendor github.com/golang/protobuf/proto

* Update vendor github.com/lufia/iostat

* Update vendor github.com/matttproud/golang_protobuf_extensions/pbutil@v1.0.0

* Update vendor github.com/prometheus/client_golang/...

* Update vendor github.com/prometheus/common/...

* Update vendor github.com/prometheus/procfs/...

* Update vendor github.com/sirupsen/logrus@v1.0.3

Adds vendor golang.org/x/crypto

* Update vendor golang.org/x/net/...

* Update vendor golang.org/x/sys/...

* Update end to end output.
2017-10-05 16:20:47 +02:00
Karsten Weiss b0d5c00832 cpu: Metric 'package_throttles_total' is per package. (#657)
* cpu: Metric 'package_throttles_total' is per package.

'package_throttles_total' is per package, not per cpu. This also reduces
the total number of cpu time series a lot (esp for multi core cpus).

* cpu: Better handling of a cpulist edge-case.

* cpu: Extract the package number from the directory name.

Do not rely on the range index.

* cpu: Add package_throttle_count for node0 cpu1

This file must be ignored by the cpu collector.
2017-09-07 23:24:18 +02:00
Ben Kochie 46c31d8a7e Enable IPVS collector by default (#623)
* Silence error output when no IPVS present.
* Enable by default.
* Update end-to-end fixture.
* Update README.
2017-07-26 15:20:28 +02:00
Andrea De Pasquale 1369763067 Change raid0 status line regexp for mdadm collector (#619) 2017-07-20 17:04:33 +02:00
Aleksey Zhukov 7a914e58f2 Add parsing /proc/net/snmp6 file for netstat-linux (#615)
* Add parsing /proc/net/snmp6 file

* add /proc/net/snmp6 fixture

* fix e2e test

* gofmt

* remove unuser variable

* safe checks

* add tests

* change help format
2017-07-08 20:16:35 +02:00
Matt Layher 6e82fd1c56 Add XFS block mapping and block map B-tree stats (#575) 2017-07-07 07:27:52 +02:00
ideaship 8d90276283 Add bcache collector (#597)
* Add bcache collector for Linux

This collector gathers metrics related to the Linux block cache
(bcache) from sysfs.

* Removed commented out code

* Use project comment style

* Add _sectors to metric name to indicate unit

* Really use project comment style

* Rename bcache.go to bcache_linux.go

* Keep collector namespace clean

Rename:
- metric -> bcacheMetric
- periodStatsToMetrics -> bcachePeriodStatsToMetric

* Shorten slice initialization

* Change label names to backing_device, cache_device

* Remove five minute metrics (keep only total)

* Include units in additional metric names

* Enable bcache collector by default

* Provide metrics in seconds, not nanoseconds

* remove metrics with label "all"

* Add fixtures, update end-to-end for bcache collector

* Move fixtures/sys into tar.gz

This changeset moves the collector/fixtures/sys directory into
collector/fixtures/sys.tar.gz and tweaks the Makefile to unpack the
tarball before tests are run.

The reason for this change is that Windows does not allow colons in a
path (colons are present in some of the bcache fixture files), nor can
it (out of the box) deal with pathnames longer than 260 characters
(which we would be increasingly likely to hit if we tried to replace
colons with longer codes that are guaranteed not the turn up in regular
file names).

* Add ttar: plain text archive, replacement for tar

This changeset adds ttar, a plain text replacement for tar, and uses it
for the sysfs fixture archive. The syntax is loosely based on tar(1).

Using a plain text archive makes it possible to review changes without
downloading and extracting the archive. Also, when working on the repo,
git diff and git log become useful again, allowing a committer to verify
and track changes over time.

The code is written in bash, because bash is available out of the box on
all major flavors of Linux and on macOS. The feature set used is
restricted to bash version 3.2 because that is what Apple is still
shipping.

The programm also works on Windows if bash is installed. Obviously, it
does not solve the Windows limitations (path length limited to 260
characters, no symbolic links) that prompted the move to an archive
format in the first place.
2017-07-07 07:20:18 +02:00
Rene Treffer bcc3cd92b8 Fix cpufreq statistics by converting kHz to Hz 2017-06-27 11:05:55 +02:00
Ben Kochie 182810056f Fix Linux cpu errors (#606)
Make the Linux cpu collector soft-error on missing `cpufreq` and
`thermal_throttle` features.
2017-06-20 07:51:26 +02:00
Rene Treffer 2e9f1913b8 Move stat_linux to cpu_linux and add cpufreq stats (#548) 2017-06-13 11:21:53 +02:00
Emanuele Rocca 047003b6bb Add qdisc collector for Linux (#580)
* Add qdisc collector for Linux

This collector gathers basic queueing discipline metrics via netlink,
similarly to what `tc -s qdisc show` does.

* qdisc collector: nl-specific code moved, names fixed

- netlink-specific parts moved to github.com/ema/qdisc
- avoid using shortened names
- counters renamed into XXX_total

* Get rid of parseMessage error checking leftover

* Add github.com/ema/qdisc to vendored packages

* Update help texts and comments

* Add qdisc collector to README file

* qdisc collector end-to-end testing

* Update qdisc dependency to latest version

Update github.com/ema/qdisc dependency to revision 2c7e72d, which
includes unit testing.

* qdisc collector: rename "iface" label into "device"
2017-05-23 11:55:50 +02:00
Robert Clark 58f50b31f2 Multiply port data XMIT/RCV metrics by 4 (#579)
According to Mellanox, it is standard practice that the port_xmit_data and port_rcv_data
files are split into 4 lanes. To get the actual transmit and receive values for each
port, the metric needs to be multiplied by 4.

Signed-Off-By: Robert Clark <robert.d.clark@hpe.com>
2017-05-12 07:28:53 +02:00
Matt Layher 1feb091b36
Initial XFS collector 2017-04-22 11:53:07 -04:00
Karsten Weiss d9703ff7c6 edac: Fix typo in csrow label of node_edac_csrow_uncorrectable_errors_total metric. 2017-04-18 12:45:06 +02:00
Karsten Weiss 45ca8db352 Support the 'guest_nice' cpu mode of /proc/stat.
'guest_nice' is available since Linux 2.6.33.
2017-04-14 12:50:37 +02:00
Sam Kottler 6eafa51fa8 Add ARP collector for Linux (#540)
* Implement commonalities and linux support for ARP collection

* Add ARP collector to fixtures and run as part of e2e tests

* Bubble up scanner errors

* Use single return values where it makes sense

* Add missing annotation

* Move arp_common into arp_linux

* Add license header to arp_linux.go

* Address initial feedback

* Use strings.Fields instead of strings.Split

* Deal with scanner.Err() rather than throwing away errors

* Check for scan errors in-line before interacting with the entries map

* Don't interact with potentially empty text from scan

* Check for scan errors outside the scan loop

* Add comment about moving procfs parsing

* Add more direct comment

* Update initialism style to match go style guide

* Put function args on the same line

* Add TODO in front of comment about procfs extraction

* Guard against strings.Fields returning an empty slice

* Be more defensive about ARP table format and use upcase more broadly

* Enable the ARP collector by default

* Add ARP collector to the README

* Remove 'entry'
2017-04-11 17:45:19 +02:00
Johannes 'fish' Ziemke 9676f5f2dc Merge pull request #523 from roclark/support-legacy-infiniband
Add support for legacy InfiniBand drivers
2017-03-21 10:52:07 +01:00
Matt Layher 2bfe410fb7
Expand wifi collector for more interface types 2017-03-20 12:25:01 -04:00
Robert Clark 3a5917dfdc Add support for legacy InfiniBand drivers
Older versions of the OFED drivers contain 64-bit variants of the port counters and are located in a directory named 'counters_ext'. This patch includes these older metrics that have since been deprecated with OFED 4.0.

Signed-Off-By: Robert Clark <robert.d.clark@hpe.com>
2017-03-20 10:37:21 -05:00
Tobias Schmidt 0400e437be Fix and simplify parsing of raid metrics
Fixes the wrong reporting of active+total disk metrics for inactive
raids. Also simplifies the code and removes a couple of redundant
comments.
2017-03-19 08:03:58 -03:00
Matt Layher 69368b7f9c Add synthetic node_wifi_station_info metric for BSS information 2017-03-16 16:24:23 -04:00
Brian Brazil a02e469b07 Report collector success/failure and duration per scrape. (#516)
This is in line with best practices, and also saves us
63 timeseries on a default Linux setup.
2017-03-16 17:21:00 +00:00
Tobias Schmidt ce117d7a40 Update vendored packages 2017-02-28 18:20:24 -04:00