* consul: improve consul service discovery
Related to #3711
- Add the ability to filter by tag and node-meta in an efficient way (`/catalog/services`
allow filtering by node-meta, and returns a `map[string]string` or `service`->`tags`).
Tags and nore-meta are also used in `/catalog/service` requests.
- Do not require a call to the catalog if services are specified by name. This is important
because on large cluster `/catalog/services` changes all the time.
- Add `allow_stale` configuration option to do stale reads. Non-stale
reads can be costly, even more when you are doing them to a remote
datacenter with 10k+ targets over WAN (which is common for federation).
- Add `refresh_interval` to minimize the strain on the catalog and on the
service endpoint. This is needed because of that kind of behavior from
consul: https://github.com/hashicorp/consul/issues/3712 and because a catalog
on a large cluster would basically change *all* the time. No need to discover
targets in 1sec if we scrape them every minute.
- Added plenty of unit tests.
Benchmarks
----------
```yaml
scrape_configs:
- job_name: prometheus
scrape_interval: 60s
static_configs:
- targets: ["127.0.0.1:9090"]
- job_name: "observability-by-tag"
scrape_interval: "60s"
metrics_path: "/metrics"
consul_sd_configs:
- server: consul.service.par.consul.prod.crto.in:8500
tag: marathon-user-observability # Used in After
refresh_interval: 30s # Used in After+delay
relabel_configs:
- source_labels: [__meta_consul_tags]
regex: ^(.*,)?marathon-user-observability(,.*)?$
action: keep
- job_name: "observability-by-name"
scrape_interval: "60s"
metrics_path: "/metrics"
consul_sd_configs:
- server: consul.service.par.consul.prod.crto.in:8500
services:
- observability-cerebro
- observability-portal-web
- job_name: "fake-fake-fake"
scrape_interval: "15s"
metrics_path: "/metrics"
consul_sd_configs:
- server: consul.service.par.consul.prod.crto.in:8500
services:
- fake-fake-fake
```
Note: tested with ~1200 services, ~5000 nodes.
| Resource | Empty | Before | After | After + delay |
| -------- |:-----:|:------:|:-----:|:-------------:|
|/service-discovery size|5K|85MiB|27k|27k|27k|
|`go_memstats_heap_objects`|100k|1M|120k|110k|
|`go_memstats_heap_alloc_bytes`|24MB|150MB|28MB|27MB|
|`rate(go_memstats_alloc_bytes_total[5m])`|0.2MB/s|28MB/s|2MB/s|0.3MB/s|
|`rate(process_cpu_seconds_total[5m])`|0.1%|15%|2%|0.01%|
|`process_open_fds`|16|*1236*|22|22|
|`rate(prometheus_sd_consul_rpc_duration_seconds_count{call="services"}[5m])`|~0|1|1|*0.03*|
|`rate(prometheus_sd_consul_rpc_duration_seconds_count{call="service"}[5m])`|0.1|*80*|0.5|0.5|
|`prometheus_target_sync_length_seconds{quantile="0.9",scrape_job="observability-by-tag"}`|N/A|200ms|0.2ms|0.2ms|
|Network bandwidth|~10kbps|~2.8Mbps|~1.6Mbps|~10kbps|
Filtering by tag using relabel_configs uses **100kiB and 23kiB/s per service per job** and quite a lot of CPU. Also sends and additional *1Mbps* of traffic to consul.
Being a little bit smarter about this reduces the overhead quite a lot.
Limiting the number of `/catalog/services` queries per second almost removes the overhead of service discovery.
* consul: tweak `refresh_interval` behavior
`refresh_interval` now does what is advertised in the documentation,
there won't be more that one update per `refresh_interval`. It now
defaults to 30s (which was also the current waitTime in the consul query).
This also make sure we don't wait another 30s if we already waited 29s
in the blocking call by substracting the number of elapsed seconds.
Hopefully this will do what people expect it does and will be safer
for existing consul infrastructures.
There is currently no way to differentiate Windows instances from Linux
ones. This is needed when you have a mix of node_exporters /
wmi_exporters for OS-level metrics and you want to have them in separate
scrape jobs.
This change allows you to do just that. Example:
```
- job_name: 'node'
azure_sd_configs:
- <azure_sd_config>
relabel_configs:
- source_labels: [__meta_azure_machine_os_type]
regex: Linux
action: keep
```
The way the vendor'd AzureSDK provides to get the OsType is a bit
awkward - as far as I can tell, this information can only be gotten from
the startup disk. Newer versions of the SDK appear to improve this a
bit (by having OS information in the InstanceView), but the current way
still works.
The docs suggest that alert templating only works in the summary and
description annotation fields. Some testing and a review of the code
suggests this is no longer true and that you can template any
annotation field.
As discussed generally consider SDs as unstable, as realistically they
are never going to be. Drop the words "experimental/beta" from most
places in the docs, as users are getting the wrong impression from this.
For special remote read endpoints which have only data for specific
queries, it is desired to limit the number of queries sent to the
configured remote read endpoint to reduce latency and performance
overhead.
The only section that still aplies was the one on the default storage
directory so those docs seem obsolete.
We'll probably have a similar page on the new storage but we'll only
find out what caveats etc. we'll have to point out as we get people
reporting problems or notable behavior.
Go automatically configures the number of used threads appropriately
and tweaking it is no longer relevant for a basic setup of Prometheus.
The baseline consumption tied to the storage layer no longer applies.
In order to provide documentation for each individual version, this
commit starts moving Prometheus server specific documentation into the
repository itself.