prometheus/config/testdata
Corentin Chary 60dafd425c consul: improve consul service discovery (#3814)
* consul: improve consul service discovery

Related to #3711

- Add the ability to filter by tag and node-meta in an efficient way (`/catalog/services`
  allow filtering by node-meta, and returns a `map[string]string` or `service`->`tags`).
  Tags and nore-meta are also used in `/catalog/service` requests.
- Do not require a call to the catalog if services are specified by name. This is important
  because on large cluster `/catalog/services` changes all the time.
- Add `allow_stale` configuration option to do stale reads. Non-stale
  reads can be costly, even more when you are doing them to a remote
  datacenter with 10k+ targets over WAN (which is common for federation).
- Add `refresh_interval` to minimize the strain on the catalog and on the
  service endpoint. This is needed because of that kind of behavior from
  consul: https://github.com/hashicorp/consul/issues/3712 and because a catalog
  on a large cluster would basically change *all* the time. No need to discover
  targets in 1sec if we scrape them every minute.
- Added plenty of unit tests.

Benchmarks
----------

```yaml
scrape_configs:

- job_name: prometheus
  scrape_interval: 60s
  static_configs:
    - targets: ["127.0.0.1:9090"]

- job_name: "observability-by-tag"
  scrape_interval: "60s"
  metrics_path: "/metrics"
  consul_sd_configs:
    - server: consul.service.par.consul.prod.crto.in:8500
      tag: marathon-user-observability  # Used in After
      refresh_interval: 30s             # Used in After+delay
  relabel_configs:
    - source_labels: [__meta_consul_tags]
      regex: ^(.*,)?marathon-user-observability(,.*)?$
      action: keep

- job_name: "observability-by-name"
  scrape_interval: "60s"
  metrics_path: "/metrics"
  consul_sd_configs:
    - server: consul.service.par.consul.prod.crto.in:8500
      services:
        - observability-cerebro
        - observability-portal-web

- job_name: "fake-fake-fake"
  scrape_interval: "15s"
  metrics_path: "/metrics"
  consul_sd_configs:
    - server: consul.service.par.consul.prod.crto.in:8500
      services:
        - fake-fake-fake
```

Note: tested with ~1200 services, ~5000 nodes.

| Resource | Empty | Before | After | After + delay |
| -------- |:-----:|:------:|:-----:|:-------------:|
|/service-discovery size|5K|85MiB|27k|27k|27k|
|`go_memstats_heap_objects`|100k|1M|120k|110k|
|`go_memstats_heap_alloc_bytes`|24MB|150MB|28MB|27MB|
|`rate(go_memstats_alloc_bytes_total[5m])`|0.2MB/s|28MB/s|2MB/s|0.3MB/s|
|`rate(process_cpu_seconds_total[5m])`|0.1%|15%|2%|0.01%|
|`process_open_fds`|16|*1236*|22|22|
|`rate(prometheus_sd_consul_rpc_duration_seconds_count{call="services"}[5m])`|~0|1|1|*0.03*|
|`rate(prometheus_sd_consul_rpc_duration_seconds_count{call="service"}[5m])`|0.1|*80*|0.5|0.5|
|`prometheus_target_sync_length_seconds{quantile="0.9",scrape_job="observability-by-tag"}`|N/A|200ms|0.2ms|0.2ms|
|Network bandwidth|~10kbps|~2.8Mbps|~1.6Mbps|~10kbps|

Filtering by tag using relabel_configs uses **100kiB and 23kiB/s per service per job** and quite a lot of CPU. Also sends and additional *1Mbps* of traffic to consul.
Being a little bit smarter about this reduces the overhead quite a lot.
Limiting the number of `/catalog/services` queries per second almost removes the overhead of service discovery.

* consul: tweak `refresh_interval` behavior

`refresh_interval` now does what is advertised in the documentation,
there won't be more that one update per `refresh_interval`. It now
defaults to 30s (which was also the current waitTime in the consul query).

This also make sure we don't wait another 30s if we already waited 29s
in the blocking call by substracting the number of elapsed seconds.

Hopefully this will do what people expect it does and will be safer
for existing consul infrastructures.
2018-03-23 14:48:43 +00:00
..
bearertoken.bad.yml Configuration options for bearer tokens, client certs & CA certs 2015-08-04 17:18:46 +01:00
bearertoken_basicauth.bad.yml Configuration options for bearer tokens, client certs & CA certs 2015-08-04 17:18:46 +01:00
conf.good.yml consul: improve consul service discovery (#3814) 2018-03-23 14:48:43 +00:00
first.rules Fix regression of alert rules state loss on config reload. (#3382) 2017-11-01 12:58:00 +01:00
global_timeout.good.yml Fix global config YAML issues 2016-02-15 14:08:25 +01:00
jobname.bad.yml Allow number to be the first letter as well for job_name 2016-09-16 14:06:47 +03:00
jobname_dup.bad.yml Switch config to YAML format. 2015-05-07 16:52:14 +02:00
kubernetes_bearertoken.bad.yml config: adapt unit tests 2016-10-17 10:32:10 +02:00
kubernetes_bearertoken_basicauth.bad.yml config: adapt unit tests 2016-10-17 10:32:10 +02:00
kubernetes_namespace_discovery.bad.yml Allow limiting Kubernetes service discover to certain namespaces 2017-04-27 07:41:36 -04:00
kubernetes_role.bad.yml config: validate Kubernetes role correctly. 2016-07-18 22:24:41 +09:00
labeldrop.bad.yml Stricter Relabel Config Checking for Labeldrop/keep (#2510) 2017-03-18 22:32:08 +01:00
labeldrop2.bad.yml Stricter Relabel Config Checking for Labeldrop/keep (#2510) 2017-03-18 22:32:08 +01:00
labeldrop3.bad.yml Stricter Relabel Config Checking for Labeldrop/keep (#2510) 2017-03-18 22:32:08 +01:00
labeldrop4.bad.yml Stricter Relabel Config Checking for Labeldrop/keep (#2510) 2017-03-18 22:32:08 +01:00
labeldrop5.bad.yml Stricter Relabel Config Checking for Labeldrop/keep (#2510) 2017-03-18 22:32:08 +01:00
labelkeep.bad.yml Stricter Relabel Config Checking for Labeldrop/keep (#2510) 2017-03-18 22:32:08 +01:00
labelkeep2.bad.yml Stricter Relabel Config Checking for Labeldrop/keep (#2510) 2017-03-18 22:32:08 +01:00
labelkeep3.bad.yml Stricter Relabel Config Checking for Labeldrop/keep (#2510) 2017-03-18 22:32:08 +01:00
labelkeep4.bad.yml Stricter Relabel Config Checking for Labeldrop/keep (#2510) 2017-03-18 22:32:08 +01:00
labelkeep5.bad.yml Stricter Relabel Config Checking for Labeldrop/keep (#2510) 2017-03-18 22:32:08 +01:00
labelmap.bad.yml Prevent invalid label names with labelmap (#3868) 2018-02-21 10:02:22 +00:00
labelname.bad.yml Rename global "labels" config option to "external_labels". 2015-09-29 20:54:20 +02:00
labelname2.bad.yml Rename global "labels" config option to "external_labels". 2015-09-29 20:54:20 +02:00
marathon_no_servers.bad.yml Fix missing unmarshal for Marathon SD config. 2015-09-06 20:02:22 +02:00
modulus_missing.bad.yml Add 'hashmod' relabel action. 2015-06-24 21:14:53 +01:00
regex.bad.yml Switch config to YAML format. 2015-05-07 16:52:14 +02:00
remote_read_url_missing.bad.yml Make sure that url for remote_read/write is not nil (#3024) 2017-08-07 08:49:45 +01:00
remote_write_url_missing.bad.yml Make sure that url for remote_read/write is not nil (#3024) 2017-08-07 08:49:45 +01:00
rules.bad.yml Load rule files from entire directories 2015-06-01 21:12:31 +02:00
rules_abs_path.good.yml Fixing tests for Windows 2017-07-09 01:59:30 -03:00
rules_abs_path_windows.good.yml Fixing tests for Windows 2017-07-09 01:59:30 -03:00
scrape_interval.bad.yml Restrict scrape timeout to interval length 2016-02-12 12:52:22 +01:00
static_config.bad.json config: deprecate target_groups for static_configs 2016-06-08 15:55:25 +02:00
target_label_hashmod_missing.bad.yml Forbid invalid relabel configurations 2016-08-29 16:56:06 +02:00
target_label_missing.bad.yml Forbid invalid relabel configurations 2016-08-29 16:56:06 +02:00
unknown_attr.bad.yml Rename global "labels" config option to "external_labels". 2015-09-29 20:54:20 +02:00
unknown_global_attr.bad.yml config: Fix overflow checking in global config (#2783) 2017-05-30 20:58:06 +02:00
url_in_targetgroup.bad.yml config: deprecate target_groups for static_configs 2016-06-08 15:55:25 +02:00