Add bcache collector (#597)

* Add bcache collector for Linux

This collector gathers metrics related to the Linux block cache
(bcache) from sysfs.

* Removed commented out code

* Use project comment style

* Add _sectors to metric name to indicate unit

* Really use project comment style

* Rename bcache.go to bcache_linux.go

* Keep collector namespace clean

Rename:
- metric -> bcacheMetric
- periodStatsToMetrics -> bcachePeriodStatsToMetric

* Shorten slice initialization

* Change label names to backing_device, cache_device

* Remove five minute metrics (keep only total)

* Include units in additional metric names

* Enable bcache collector by default

* Provide metrics in seconds, not nanoseconds

* remove metrics with label "all"

* Add fixtures, update end-to-end for bcache collector

* Move fixtures/sys into tar.gz

This changeset moves the collector/fixtures/sys directory into
collector/fixtures/sys.tar.gz and tweaks the Makefile to unpack the
tarball before tests are run.

The reason for this change is that Windows does not allow colons in a
path (colons are present in some of the bcache fixture files), nor can
it (out of the box) deal with pathnames longer than 260 characters
(which we would be increasingly likely to hit if we tried to replace
colons with longer codes that are guaranteed not the turn up in regular
file names).

* Add ttar: plain text archive, replacement for tar

This changeset adds ttar, a plain text replacement for tar, and uses it
for the sysfs fixture archive. The syntax is loosely based on tar(1).

Using a plain text archive makes it possible to review changes without
downloading and extracting the archive. Also, when working on the repo,
git diff and git log become useful again, allowing a committer to verify
and track changes over time.

The code is written in bash, because bash is available out of the box on
all major flavors of Linux and on macOS. The feature set used is
restricted to bash version 3.2 because that is what Apple is still
shipping.

The programm also works on Windows if bash is installed. Obviously, it
does not solve the Windows limitations (path length limited to 260
characters, no symbolic links) that prompted the move to an archive
format in the first place.
This commit is contained in:
ideaship 2017-07-07 07:20:18 +02:00 committed by Ben Kochie
parent bba075710d
commit 8d90276283
216 changed files with 2452 additions and 299 deletions

View file

@ -41,11 +41,15 @@ style:
@echo ">> checking code style" @echo ">> checking code style"
@! gofmt -d $(shell find . -path ./vendor -prune -o -name '*.go' -print) | grep '^' @! gofmt -d $(shell find . -path ./vendor -prune -o -name '*.go' -print) | grep '^'
test: test: collector/fixtures/sys/.unpacked
@echo ">> running tests" @echo ">> running tests"
@$(GO) test -short $(pkgs) @$(GO) test -short $(pkgs)
test-e2e: build collector/fixtures/sys/.unpacked: collector/fixtures/sys.ttar
./ttar -C collector/fixtures -x -f collector/fixtures/sys.ttar
touch $@
test-e2e: build collector/fixtures/sys/.unpacked
@echo ">> running end-to-end tests" @echo ">> running end-to-end tests"
./end-to-end-test.sh ./end-to-end-test.sh

View file

@ -22,6 +22,7 @@ Which collectors are used is controlled by the `--collectors.enabled` flag.
Name | Description | OS Name | Description | OS
---------|-------------|---- ---------|-------------|----
arp | Exposes ARP statistics from `/proc/net/arp`. | Linux arp | Exposes ARP statistics from `/proc/net/arp`. | Linux
bcache | Exposes bcache statistics from `/sys/fs/bcache/`. | Linux
conntrack | Shows conntrack statistics (does nothing if no `/proc/sys/net/netfilter/` present). | Linux conntrack | Shows conntrack statistics (does nothing if no `/proc/sys/net/netfilter/` present). | Linux
cpu | Exposes CPU statistics | Darwin, Dragonfly, FreeBSD, Linux cpu | Exposes CPU statistics | Darwin, Dragonfly, FreeBSD, Linux
diskstats | Exposes disk I/O statistics. | Darwin, Linux diskstats | Exposes disk I/O statistics. | Darwin, Linux

304
collector/bcache_linux.go Normal file
View file

@ -0,0 +1,304 @@
// Copyright 2017 The Prometheus Authors
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
// +build !nobcache
package collector
import (
"fmt"
// https://godoc.org/github.com/prometheus/client_golang/prometheus
"github.com/prometheus/client_golang/prometheus"
"github.com/prometheus/procfs/bcache"
"github.com/prometheus/procfs/sysfs"
)
func init() {
Factories["bcache"] = NewBcacheCollector
}
// A bcacheCollector is a Collector which gathers metrics from Linux bcache.
type bcacheCollector struct {
fs sysfs.FS
}
// NewBcacheCollector returns a newly allocated bcacheCollector.
// It exposes a number of Linux bcache statistics.
func NewBcacheCollector() (Collector, error) {
fs, err := sysfs.NewFS(*sysPath)
if err != nil {
return nil, fmt.Errorf("failed to open sysfs: %v", err)
}
return &bcacheCollector{
fs: fs,
}, nil
}
// Update reads and exposes bcache stats.
// It implements the Collector interface.
func (c *bcacheCollector) Update(ch chan<- prometheus.Metric) error {
stats, err := c.fs.BcacheStats()
if err != nil {
return fmt.Errorf("failed to retrieve bcache stats: %v", err)
}
for _, s := range stats {
c.updateBcacheStats(ch, s)
}
return nil
}
type bcacheMetric struct {
name string
desc string
value float64
metricType prometheus.ValueType
extraLabel []string
extraLabelValue string
}
func bcachePeriodStatsToMetric(ps *bcache.PeriodStats, labelValue string) []bcacheMetric {
label := []string{"backing_device"}
metrics := []bcacheMetric{
{
name: "bypassed_bytes_total",
desc: "Amount of IO (both reads and writes) that has bypassed the cache.",
value: float64(ps.Bypassed),
metricType: prometheus.CounterValue,
extraLabel: label,
extraLabelValue: labelValue,
},
{
name: "cache_hits_total",
desc: "Hits counted per individual IO as bcache sees them.",
value: float64(ps.CacheHits),
metricType: prometheus.CounterValue,
extraLabel: label,
extraLabelValue: labelValue,
},
{
name: "cache_misses_total",
desc: "Misses counted per individual IO as bcache sees them.",
value: float64(ps.CacheMisses),
metricType: prometheus.CounterValue,
extraLabel: label,
extraLabelValue: labelValue,
},
{
name: "cache_bypass_hits_total",
desc: "Hits for IO intended to skip the cache.",
value: float64(ps.CacheBypassHits),
metricType: prometheus.CounterValue,
extraLabel: label,
extraLabelValue: labelValue,
},
{
name: "cache_bypass_misses_total",
desc: "Misses for IO intended to skip the cache.",
value: float64(ps.CacheBypassMisses),
metricType: prometheus.CounterValue,
extraLabel: label,
extraLabelValue: labelValue,
},
{
name: "cache_miss_collisions_total",
desc: "Instances where data insertion from cache miss raced with write (data already present).",
value: float64(ps.CacheMissCollisions),
metricType: prometheus.CounterValue,
extraLabel: label,
extraLabelValue: labelValue,
},
{
name: "cache_readaheads_total",
desc: "Count of times readahead occurred.",
value: float64(ps.CacheReadaheads),
metricType: prometheus.CounterValue,
extraLabel: label,
extraLabelValue: labelValue,
},
}
return metrics
}
// UpdateBcacheStats collects statistics for one bcache ID.
func (c *bcacheCollector) updateBcacheStats(ch chan<- prometheus.Metric, s *bcache.Stats) {
const (
subsystem = "bcache"
)
var (
devLabel = []string{"uuid"}
allMetrics []bcacheMetric
metrics []bcacheMetric
)
allMetrics = []bcacheMetric{
// metrics in /sys/fs/bcache/<uuid>/
{
name: "average_key_size_sectors",
desc: "Average data per key in the btree (sectors).",
value: float64(s.Bcache.AverageKeySize),
metricType: prometheus.GaugeValue,
},
{
name: "btree_cache_size_bytes",
desc: "Amount of memory currently used by the btree cache.",
value: float64(s.Bcache.BtreeCacheSize),
metricType: prometheus.GaugeValue,
},
{
name: "cache_available_percent",
desc: "Percentage of cache device without dirty data, useable for writeback (may contain clean cached data).",
value: float64(s.Bcache.CacheAvailablePercent),
metricType: prometheus.GaugeValue,
},
{
name: "congested",
desc: "Congestion.",
value: float64(s.Bcache.Congested),
metricType: prometheus.GaugeValue,
},
{
name: "root_usage_percent",
desc: "Percentage of the root btree node in use (tree depth increases if too high).",
value: float64(s.Bcache.RootUsagePercent),
metricType: prometheus.GaugeValue,
},
{
name: "tree_depth",
desc: "Depth of the btree.",
value: float64(s.Bcache.TreeDepth),
metricType: prometheus.GaugeValue,
},
// metrics in /sys/fs/bcache/<uuid>/internal/
{
name: "active_journal_entries",
desc: "Number of journal entries that are newer than the index.",
value: float64(s.Bcache.Internal.ActiveJournalEntries),
metricType: prometheus.GaugeValue,
},
{
name: "btree_nodes",
desc: "Total nodes in the btree.",
value: float64(s.Bcache.Internal.BtreeNodes),
metricType: prometheus.GaugeValue,
},
{
name: "btree_read_average_duration_seconds",
desc: "Average btree read duration.",
value: float64(s.Bcache.Internal.BtreeReadAverageDurationNanoSeconds) * 1e-9,
metricType: prometheus.GaugeValue,
},
{
name: "cache_read_races",
desc: "Counts instances where while data was being read from the cache, the bucket was reused and invalidated - i.e. where the pointer was stale after the read completed.",
value: float64(s.Bcache.Internal.CacheReadRaces),
metricType: prometheus.CounterValue,
},
}
for _, bdev := range s.Bdevs {
// metrics in /sys/fs/bcache/<uuid>/<bdev>/
metrics = []bcacheMetric{
{
name: "dirty_data_bytes",
desc: "Amount of dirty data for this backing device in the cache.",
value: float64(bdev.DirtyData),
metricType: prometheus.GaugeValue,
extraLabel: []string{"backing_device"},
extraLabelValue: bdev.Name,
},
}
allMetrics = append(allMetrics, metrics...)
// metrics in /sys/fs/bcache/<uuid>/<bdev>/stats_total
metrics := bcachePeriodStatsToMetric(&bdev.Total, bdev.Name)
allMetrics = append(allMetrics, metrics...)
}
for _, cache := range s.Caches {
metrics = []bcacheMetric{
// metrics in /sys/fs/bcache/<uuid>/<cache>/
{
name: "io_errors",
desc: "Number of errors that have occurred, decayed by io_error_halflife.",
value: float64(cache.IOErrors),
metricType: prometheus.GaugeValue,
extraLabel: []string{"cache_device"},
extraLabelValue: cache.Name,
},
{
name: "metadata_written_bytes_total",
desc: "Sum of all non data writes (btree writes and all other metadata).",
value: float64(cache.MetadataWritten),
metricType: prometheus.CounterValue,
extraLabel: []string{"cache_device"},
extraLabelValue: cache.Name,
},
{
name: "written_bytes_total",
desc: "Sum of all data that has been written to the cache.",
value: float64(cache.Written),
metricType: prometheus.CounterValue,
extraLabel: []string{"cache_device"},
extraLabelValue: cache.Name,
},
// metrics in /sys/fs/bcache/<uuid>/<cache>/priority_stats
{
name: "priority_stats_unused_percent",
desc: "The percentage of the cache that doesn't contain any data.",
value: float64(cache.Priority.UnusedPercent),
metricType: prometheus.GaugeValue,
extraLabel: []string{"cache_device"},
extraLabelValue: cache.Name,
},
{
name: "priority_stats_metadata_percent",
desc: "Bcache's metadata overhead.",
value: float64(cache.Priority.MetadataPercent),
metricType: prometheus.GaugeValue,
extraLabel: []string{"cache_device"},
extraLabelValue: cache.Name,
},
}
allMetrics = append(allMetrics, metrics...)
}
for _, m := range allMetrics {
labels := append(devLabel, m.extraLabel...)
desc := prometheus.NewDesc(
prometheus.BuildFQName(Namespace, subsystem, m.name),
m.desc,
labels,
nil,
)
labelValues := []string{s.Name}
if m.extraLabelValue != "" {
labelValues = append(labelValues, m.extraLabelValue)
}
ch <- prometheus.MustNewConstMetric(
desc,
m.metricType,
m.value,
labelValues...,
)
}
}

View file

@ -75,6 +75,75 @@ http_response_size_bytes_count{handler="prometheus"} 0
# TYPE node_arp_entries gauge # TYPE node_arp_entries gauge
node_arp_entries{device="eth0"} 3 node_arp_entries{device="eth0"} 3
node_arp_entries{device="eth1"} 3 node_arp_entries{device="eth1"} 3
# HELP node_bcache_active_journal_entries Number of journal entries that are newer than the index.
# TYPE node_bcache_active_journal_entries gauge
node_bcache_active_journal_entries{uuid="deaddd54-c735-46d5-868e-f331c5fd7c74"} 1
# HELP node_bcache_average_key_size_sectors Average data per key in the btree (sectors).
# TYPE node_bcache_average_key_size_sectors gauge
node_bcache_average_key_size_sectors{uuid="deaddd54-c735-46d5-868e-f331c5fd7c74"} 0
# HELP node_bcache_btree_cache_size_bytes Amount of memory currently used by the btree cache.
# TYPE node_bcache_btree_cache_size_bytes gauge
node_bcache_btree_cache_size_bytes{uuid="deaddd54-c735-46d5-868e-f331c5fd7c74"} 0
# HELP node_bcache_btree_nodes Total nodes in the btree.
# TYPE node_bcache_btree_nodes gauge
node_bcache_btree_nodes{uuid="deaddd54-c735-46d5-868e-f331c5fd7c74"} 0
# HELP node_bcache_btree_read_average_duration_seconds Average btree read duration.
# TYPE node_bcache_btree_read_average_duration_seconds gauge
node_bcache_btree_read_average_duration_seconds{uuid="deaddd54-c735-46d5-868e-f331c5fd7c74"} 1.305e-06
# HELP node_bcache_bypassed_bytes_total Amount of IO (both reads and writes) that has bypassed the cache.
# TYPE node_bcache_bypassed_bytes_total counter
node_bcache_bypassed_bytes_total{backing_device="bdev0",uuid="deaddd54-c735-46d5-868e-f331c5fd7c74"} 0
# HELP node_bcache_cache_available_percent Percentage of cache device without dirty data, useable for writeback (may contain clean cached data).
# TYPE node_bcache_cache_available_percent gauge
node_bcache_cache_available_percent{uuid="deaddd54-c735-46d5-868e-f331c5fd7c74"} 100
# HELP node_bcache_cache_bypass_hits_total Hits for IO intended to skip the cache.
# TYPE node_bcache_cache_bypass_hits_total counter
node_bcache_cache_bypass_hits_total{backing_device="bdev0",uuid="deaddd54-c735-46d5-868e-f331c5fd7c74"} 0
# HELP node_bcache_cache_bypass_misses_total Misses for IO intended to skip the cache.
# TYPE node_bcache_cache_bypass_misses_total counter
node_bcache_cache_bypass_misses_total{backing_device="bdev0",uuid="deaddd54-c735-46d5-868e-f331c5fd7c74"} 0
# HELP node_bcache_cache_hits_total Hits counted per individual IO as bcache sees them.
# TYPE node_bcache_cache_hits_total counter
node_bcache_cache_hits_total{backing_device="bdev0",uuid="deaddd54-c735-46d5-868e-f331c5fd7c74"} 546
# HELP node_bcache_cache_miss_collisions_total Instances where data insertion from cache miss raced with write (data already present).
# TYPE node_bcache_cache_miss_collisions_total counter
node_bcache_cache_miss_collisions_total{backing_device="bdev0",uuid="deaddd54-c735-46d5-868e-f331c5fd7c74"} 0
# HELP node_bcache_cache_misses_total Misses counted per individual IO as bcache sees them.
# TYPE node_bcache_cache_misses_total counter
node_bcache_cache_misses_total{backing_device="bdev0",uuid="deaddd54-c735-46d5-868e-f331c5fd7c74"} 0
# HELP node_bcache_cache_read_races Counts instances where while data was being read from the cache, the bucket was reused and invalidated - i.e. where the pointer was stale after the read completed.
# TYPE node_bcache_cache_read_races counter
node_bcache_cache_read_races{uuid="deaddd54-c735-46d5-868e-f331c5fd7c74"} 0
# HELP node_bcache_cache_readaheads_total Count of times readahead occurred.
# TYPE node_bcache_cache_readaheads_total counter
node_bcache_cache_readaheads_total{backing_device="bdev0",uuid="deaddd54-c735-46d5-868e-f331c5fd7c74"} 0
# HELP node_bcache_congested Congestion.
# TYPE node_bcache_congested gauge
node_bcache_congested{uuid="deaddd54-c735-46d5-868e-f331c5fd7c74"} 0
# HELP node_bcache_dirty_data_bytes Amount of dirty data for this backing device in the cache.
# TYPE node_bcache_dirty_data_bytes gauge
node_bcache_dirty_data_bytes{backing_device="bdev0",uuid="deaddd54-c735-46d5-868e-f331c5fd7c74"} 0
# HELP node_bcache_io_errors Number of errors that have occurred, decayed by io_error_halflife.
# TYPE node_bcache_io_errors gauge
node_bcache_io_errors{cache_device="cache0",uuid="deaddd54-c735-46d5-868e-f331c5fd7c74"} 0
# HELP node_bcache_metadata_written_bytes_total Sum of all non data writes (btree writes and all other metadata).
# TYPE node_bcache_metadata_written_bytes_total counter
node_bcache_metadata_written_bytes_total{cache_device="cache0",uuid="deaddd54-c735-46d5-868e-f331c5fd7c74"} 512
# HELP node_bcache_priority_stats_metadata_percent Bcache's metadata overhead.
# TYPE node_bcache_priority_stats_metadata_percent gauge
node_bcache_priority_stats_metadata_percent{cache_device="cache0",uuid="deaddd54-c735-46d5-868e-f331c5fd7c74"} 0
# HELP node_bcache_priority_stats_unused_percent The percentage of the cache that doesn't contain any data.
# TYPE node_bcache_priority_stats_unused_percent gauge
node_bcache_priority_stats_unused_percent{cache_device="cache0",uuid="deaddd54-c735-46d5-868e-f331c5fd7c74"} 99
# HELP node_bcache_root_usage_percent Percentage of the root btree node in use (tree depth increases if too high).
# TYPE node_bcache_root_usage_percent gauge
node_bcache_root_usage_percent{uuid="deaddd54-c735-46d5-868e-f331c5fd7c74"} 0
# HELP node_bcache_tree_depth Depth of the btree.
# TYPE node_bcache_tree_depth gauge
node_bcache_tree_depth{uuid="deaddd54-c735-46d5-868e-f331c5fd7c74"} 0
# HELP node_bcache_written_bytes_total Sum of all data that has been written to the cache.
# TYPE node_bcache_written_bytes_total counter
node_bcache_written_bytes_total{cache_device="cache0",uuid="deaddd54-c735-46d5-868e-f331c5fd7c74"} 0
# HELP node_bonding_active Number of active slaves per bonding interface. # HELP node_bonding_active Number of active slaves per bonding interface.
# TYPE node_bonding_active gauge # TYPE node_bonding_active gauge
node_bonding_active{master="bond0"} 0 node_bonding_active{master="bond0"} 0
@ -2164,6 +2233,7 @@ node_qdisc_requeues_total{device="wlan0",kind="fq"} 1
# HELP node_scrape_collector_success node_exporter: Whether a collector succeeded. # HELP node_scrape_collector_success node_exporter: Whether a collector succeeded.
# TYPE node_scrape_collector_success gauge # TYPE node_scrape_collector_success gauge
node_scrape_collector_success{collector="arp"} 1 node_scrape_collector_success{collector="arp"} 1
node_scrape_collector_success{collector="bcache"} 1
node_scrape_collector_success{collector="bonding"} 1 node_scrape_collector_success{collector="bonding"} 1
node_scrape_collector_success{collector="buddyinfo"} 1 node_scrape_collector_success{collector="buddyinfo"} 1
node_scrape_collector_success{collector="conntrack"} 1 node_scrape_collector_success{collector="conntrack"} 1

1803
collector/fixtures/sys.ttar Normal file

File diff suppressed because it is too large Load diff

View file

@ -1 +0,0 @@
../../devices/platform/coretemp.0/hwmon/hwmon0

View file

@ -1 +0,0 @@
../../devices/platform/coretemp.1/hwmon/hwmon1

View file

@ -1 +0,0 @@
../../devices/platform/applesmc.768/hwmon/hwmon2

View file

@ -1 +0,0 @@
../../devices/platform/nct6775.656/hwmon/hwmon3

View file

@ -1 +0,0 @@
100000

View file

@ -1 +0,0 @@
foosensor

View file

@ -1 +0,0 @@
100000

View file

@ -1 +0,0 @@
100000

View file

@ -1 +0,0 @@
foosensor

View file

@ -1 +0,0 @@
100000

View file

@ -1 +0,0 @@
bond0 dmz int

View file

@ -1 +0,0 @@
eth0 eth4

View file

@ -1 +0,0 @@
eth5 eth1

View file

@ -1 +0,0 @@
../../../applesmc.768

View file

@ -1 +0,0 @@
../../../coretemp.0

Some files were not shown because too many files have changed in this diff Show more