* Implement commonalities and linux support for ARP collection
* Add ARP collector to fixtures and run as part of e2e tests
* Bubble up scanner errors
* Use single return values where it makes sense
* Add missing annotation
* Move arp_common into arp_linux
* Add license header to arp_linux.go
* Address initial feedback
* Use strings.Fields instead of strings.Split
* Deal with scanner.Err() rather than throwing away errors
* Check for scan errors in-line before interacting with the entries map
* Don't interact with potentially empty text from scan
* Check for scan errors outside the scan loop
* Add comment about moving procfs parsing
* Add more direct comment
* Update initialism style to match go style guide
* Put function args on the same line
* Add TODO in front of comment about procfs extraction
* Guard against strings.Fields returning an empty slice
* Be more defensive about ARP table format and use upcase more broadly
* Enable the ARP collector by default
* Add ARP collector to the README
* Remove 'entry'
Instead of maintaining a counter metric for device errors in memory,
this change exports a gauge and uses const metrics to avoid leaking
metrics for unmounted filesystems.
Older versions of the OFED drivers contain 64-bit variants of the port counters and are located in a directory named 'counters_ext'. This patch includes these older metrics that have since been deprecated with OFED 4.0.
Signed-Off-By: Robert Clark <robert.d.clark@hpe.com>
In case a metric file within the InfiniBand collector doesn't exist, skip the metric in order to allow collection of the remaining valid InfiniBand metrics.
Signed-Off-By: Robert Clark <robert.d.clark@hpe.com>
Named return variables should only be used to describe the returned type
further, e.g. `err error` doesn't add any new information and is just
stutter.
Add new metrics for the InfiniBand network protocol including the amount of packets sent and received, the number of times the link has been downed and how many times the link has recovered from an error state.
Signed-Off-By: Robert Clark <robert.d.clark@hpe.com>
Removed all global types that were unnecessary, and refactored to use constructor-created values and inline values instead of globals.
Signed-Off-By: Joe Handzik <joseph.t.handzik@hpe.com>
This also involves removing zfs_zpool code for now.
Signed-Off-By: Corey Stewart <stewa169@purdue.edu>
Signed-Off-By: Joe Handzik <joseph.t.handzik@hpe.com>
This patch makes stylistic changes to error strings, unexports method names by lower casing them, removes unused dataSetMetric, and adds copyright/licence information.
Signed-Off-By: Corey Stewart <stewa169@purdue.edu>
It is tested on FreeBSD 10.2-RELEASE and Linux (ZFS on Linux 0.6.5.4).
On FreeBSD, Solaris, etc. ZFS metrics are exposed through sysctls.
ZFS on Linux exposes the same metrics through procfs `/proc/spl/...`.
In addition to sysctl metrics, 'computed metrics' are exposed by
the collector, which are based on several sysctl values.
There is some conditional logic involved in computing these metrics
which cannot be easily mapped to PromQL.
Not all 92 ARC sysctls are exposed right now but this can be changed
with one additional LOC each.
The devstat API expects us to reuse one devinfo for many invocations of
devstat_getstats. In particular, it allocates and resizes memory
referenced by devinfo.
Querying the number of devices separately from the device list itself is
racy. Devices may be added or removed between the two calls; and removed
devices would lead to a segfault.
The memory allocated by calloc was never freed. Since the devinfo struct
never leaves the function, anyway, we might as well just allocate it on
the stack.
It seems solaris prefers "sys/loadavg.h" over "stdlib.h" when
fetching the load average.
For Illumos based OSes it was required to include "sys/time.h" to
ensure that "hrtime_t" was defined.
https://www.illumos.org/issues/6002
It also required setting the ldflags "-fno-stack-protector -lssp" to
avoid undefined symbols when linking with gcc.
/opt/local/go/pkg/tool/solaris_amd64/link: running gcc failed: exit status 1
Undefined first referenced
symbol in file
__stack_chk_fail /tmp/go-link-138622936/000002.o
__stack_chk_guard /tmp/go-link-138622936/000002.o
Instead of doing the whole metric exposition in a platform specific collector
implementation, this creates and updates the metrics in meminfo.go and
expected a platform specific implementation of getMemInfo on
*meminfoCollector.
This removes some error handling, which should be fine. If the calls
fail, we will get the zeroes, which is a safe enough fallback.
Additionally, if the first sysctl (page_size) succeeded it is unlikely
that other ones will fail.
node_exporter currently triggers autofs to mount the underlying
filesystem on every scrape. This is undesirable. Better ignore autofs.
The underlying filesystem that autofs mounts will be monitored though,
when the (real) filesystem is mounted.
They get printed all the time, as there are some tokens in the /proc
file that we simply don't support. It's better to keep these as
debugging messages, which may come in useful if new tags start to
appear.
- Use the right number of printf() arguments. Use %q where it makes sense.
- Use "DRBD" instead of "Drbd", per Go's style guide.
- Add _total suffixes to counter metrics.
- Mention the unit (bytes) in documentation strings once more.
This collector exposes most of the useful information that can be found
in /proc/drbd. Sizes are normalised to be in bytes, as /proc/drbd uses
kibibytes.
This change adds a new collector called "nfs" that parses the contents
of /proc/net/rpc/nfs and turns it into metrics. It can be used to
inspect the number of operations per type, but also to keep an eye on an
extraneous number of retransmissions, which may indicate connectivity
issues.
I've picked the name "nfs", as most operating systems use "nfs" for the
client component and "nfsd" as the server component. If we want to add
stats for the NFS server as well, we'd better call such a collector
"nfsd".
The chip label generation has been changed in #334 to prefer the
unique device path (e.g. the location on the PCI bus) due to #333.
Here, a new annotation metric ``node_hwmon_chip_names`` is
introduced which allows to link the unique chip sysfs path to a
human-readable chip name which may not be unique among chip sysfs
paths (for example, dual-slot systems have multiple
chipType="coretemp" sensors).
This allows to mitigate the downsides of the solution to #333
(namely that the device path may not be stable across kernels and
reboots) for cases where it does not matter that multiple devices
may have the same human-readable name (e.g. aggregation or where
at most one device with a common chip name is present).
For cases where no human-readable name can be derived, the
annotation metric is not emitted.
We seem to have a small number of Linux servers here that have lines in
/proc/mdstat that cannot be parsed by the node exporter, due to them
containing attributes that are not matched by the regular expression
("super 1.2").
Extend the regular expression to skip this data, just like we do for all
of the other status lines.
* Prefer device path based names over exported names
For some sensors (like coretemp) it is possible that multiple
instances exist, thus base the name on the device path and not on
the exported name.
* Update end-to-end test for dual socket machines
Explicitly have 2 coretemp instances with a symlink for the device
such that the hwmon collector must pick that name (or fail)
* Add Linux NUMA "numastat" metrics
Read the `numastat` metrics from /sys/devices/system/node/node* when reading NUMA meminfo metrics.
* Update end-to-end test output.
* Add `numastat` metrics as counters.
* Add tests for error conditions.
* Refactor meminfo numa metrics struct
* Refactor meminfoKey into a simple struct of metric data.
This makes it easier to pass slices of metrics around.
* Refactor tests.
* Fixup: Add suggested fixes.
* Fixup: More fixes
* Add another scanner.Err() return
* Add "_total" to counter metrics.
* Add hwmon support (mainly known from lm-sensors)
This commit adds initial support for linux hardware sensors, exported
through sysfs.
Details of the interface can be found at
https://www.kernel.org/doc/Documentation/hwmon/sysfs-interface
* Add end-to-end test with some real life data
* Cleanup comments on hwmon collector
* Drop raw sensor name from hwmon output
* Let the sensor label be "sensor"
* Add hwmon short description to README.
The correct frequency is the systimer frequency,
not the stathz.
From one of the DragonFly developers:
The bump upon each statclock is:
((cur_systimer - prev_systimer) * systimer_freq) >> 32
systimer_freq can be extracted from following
sysctl in userspace:
sysctl kern.cputimer.freq
The convention of the linux driver is nvme($device)n($namespace)p($partition). On *bsd it seems to be different, using "ns" instead of "n" as the namespace separator.
Previously the raw time difference was used which includes the network trip time
between the node and the ntp server. This makes setting alerts off the value
troublesome as it depends on the latency as well as the clock offset.
logind provides a nice interface to find out about the numbers of sessions
on a system; it is used on most Linux distributions, even those which
aren't using systemd.
The exporter exposes the total number of sessions indexed by the following
attributes:
* seat
* type ("tty", "x11", ...)
* class ("user", "greeter", ...)
* remote ("true"/"false")
This removes the requirement to run `node_exporter` as root or with read
access to `/dev/kmem` in order to get CPU usage statistics.
Once FreeBSD adds a macro for the `kern.cp_times` sysctl, the
`setupSysctlMIBs()` function should be replaced by usage of the macro.
When compiling `20ecedd0b4c983bd7b88f97cd7a21461988a6c12` with GNU make (`gmake`) on FreeBSD 10.2-RELEASE, I get the following error:
```
collector/filesystem_bsd.go:60: non-bool mnt[i].f_flags & MNT_RDONLY (type C.uint64_t) used as if condition
Makefile.COMMON:85: recipe for target 'node_exporter' failed
gmake: *** [node_exporter] Error 2
```
This problem is fixed by this patch.
It turns out, on some kernels (notably - CentOS6) there is an empty line
inserted at the beginning of /sys/devices/system/node/node*/meminfo
files. The leads to node_exporter crash on such kernels.
Fix this by checking for empty string first.
Signed-off-by: Pavel Borzenkov <pavel.borzenkov@gmail.com>
Add new collector which exposes the content of /sys/kernel/mm/ksm
directory. This directory contains control and statistics files for
Kernel Samepage Merging daemon.
The collector is not enabled by default.
Signed-off-by: Pavel Borzenkov <pavel.borzenkov@gmail.com>
Entry collector uses readUintFromFile() function which is defined by
conntrack collector. Thus, it is impossible to build node_exporter w/o
conntrack collector. Fix this by factoring out the function into
helper.go file.
Signed-off-by: Pavel Borzenkov <pavel.borzenkov@gmail.com>
It is sometimes useful to understand the distribution of free/occupied
memory between NUMA nodes to deal with performance problems. To do so,
add new meminfo_numa collector that enables exporting of per node
statistics along with unit and end-to-end tests for it.
Signed-off-by: Pavel Borzenkov <pavel.borzenkov@gmail.com>
Removes unused signal handlers left over from signal based collection
and block the non windows-relevant collectors loadavg and interrupts.
Signal based collection removed in 1c17481a42.
As OS X doesn't have it's own interrupts provider, don't build
interrupts_common on OS X as well. Otherwise build fails, because
interrupts_common depends on variables provided by platform-specific
files.
Signed-off-by: Pavel Borzenkov <pavel.borzenkov@gmail.com>
Current behaviour throws away all stats on any Statfs error. In practice
this is not useful. This turns such errors into debug log messages -
though silently ignoring them seems even more valid to me.
This test runs a selection of collectors against the fixtures and
compares the output to a reference.
The uname and filesystem collectors are disabled because they use system
calls that cannot be fixtured easily.
Remove all hardcoded references to `/proc`. For all collectors that do
not use `github.com/prometheus/procfs` yet, provide a wrapper to
generate the full paths.
Reformulate help strings, errors and comments to remove absolute
references to `/proc`.
This is a breaking change: the `-collector.ipvs.procfs` flag is removed
in favor of the general flag. Since it only affected that collector it
was only useful for development, so this should not cause many issues.
This creates a single metric like:
node_uname_info{domainname="(none)",machine="x86_64",nodename="desktop",release="3.16.0-48-generic",sysname="Linux",version="#64~14.04.1-Ubuntu SMP Thu Aug 20 23:03:57 UTC 2015"} 1
Fixed file-nr update function
Fixed file-nr test case
Fixed file-nr test case again
Fixed file-nr separator to tab
Updated file-nr to filenr.
Updated file-nr to filenr.
Fixed file-nr test cases, added comments
Remove reporting the second value from file-nr as it will alwasy be zero in linux 2.6 and greator
Renaming file-nr to filefd
Updated build constraint
Updates and code cleanup for filefd.
Updated enabledCollectors with the correct name for filefd
Fixed filefd test wording
initial work on sockstat work
Fixed package name
Finished implementation of the sockstat plugin
missed a return value
Added sockstat to default plugins to start
Fixed scanner read on sockstat
fixed sockstat linux test for TCP alloc
update sockstat test case
Updated sockstat to return TCP and UDP memory in bytes instead of page count
Collects more information for labelling scraped filesystems with the device
and fsType. This is useful for setting alerts which should change based on
filesystem type, or for filtering out shared mounts such as with NFS volumes.
This puts all collector-specific flags into their own namespace under
"collector.<collector-name>", and moves from camel case to dashes, which
is the standard in Prometheus land now.
Fixes issue described in #38
/proc/stat reports a blank line which needs to be ignored.
Old kernels misses one CPU time field, this needs to be ignored too.
This allows static metrics (e.g. an attributes collector replacement),
and cronjobs to expose stats by echoing into a file.
For example:
echo "my_metric 123" > mycronjob.prom.$$
mv mycronjob.prom.$$ mycronjob.prom
Remove special tags necessary for gmond and runit collectors. All
collectors get built. Selection of which collectors to use continues to
happen via parameter.
This catches things like listen overflows, retransmits
and other things that are very useful for retroactive debugging
thus I think it's justified to have it on by default.
Switch to Update using the Collecter Collect interface, due to not knowing all
metricnames in all modules beforehand we can't use Describe and thus the full
Collecter interface.
Remove 'updates', it's meaning varies by module and doesn't add much.
This collector exposes two metrics:
- net_bonding_slaves: configured slaves per bonding interface
- net_bonding_slaves_active: currently active slaves per bonding
interface
This collector exports the following metrics:
- raid_drive_temperature: drive temperature
- raid_drive_count: drive error and event counters
- raid_adapter_disk_presence: disk presence per adapter
Last login is disabled by default as it's broken on ubuntu 12.04
Interrupts is disabled by default as it's very granular and we'll have total interrupts from /proc/stat
Allow ignoring devices from diskstats, ignore ram and loop devices by default.
Use glog for logging.
Diskstats: Split out metrics, keep 'device' label
Meminfo: Split out metrics, one each. Convert kB to bytes.
Netstats: Split out metrics, keep 'device' label.
Interrupts: Stays the same. Not perfect, but should be rarely used.
Loadavg: Make it clear it's the 1m loadavg
Last seen: Not clear this belongs in the node exporter, as it's more a user
thing than a machine thing. Changed to absolute time rather than relative.
All stats now have appropriate counter/gauge type.
The gmond (Ganglia) exporter module exports many metrics not under our
control. They should all be prefixed in a common way to make it obvious
where they came from.
Fixes https://github.com/prometheus/node_exporter/issues/8