prometheus

mirror of https://github.com/prometheus/prometheus.git synced 2025-03-05 20:59:13 -08:00

History

Corentin Chary 60dafd425c consul: improve consul service discovery (#3814 ) * consul: improve consul service discovery Related to #3711 - Add the ability to filter by tag and node-meta in an efficient way (`/catalog/services` allow filtering by node-meta, and returns a `map[string]string` or `service`->`tags`). Tags and nore-meta are also used in `/catalog/service` requests. - Do not require a call to the catalog if services are specified by name. This is important because on large cluster `/catalog/services` changes all the time. - Add `allow_stale` configuration option to do stale reads. Non-stale reads can be costly, even more when you are doing them to a remote datacenter with 10k+ targets over WAN (which is common for federation). - Add `refresh_interval` to minimize the strain on the catalog and on the service endpoint. This is needed because of that kind of behavior from consul: https://github.com/hashicorp/consul/issues/3712 and because a catalog on a large cluster would basically change all the time. No need to discover targets in 1sec if we scrape them every minute. - Added plenty of unit tests. Benchmarks ---------- ```yaml scrape_configs: - job_name: prometheus scrape_interval: 60s static_configs: - targets: ["127.0.0.1:9090"] - job_name: "observability-by-tag" scrape_interval: "60s" metrics_path: "/metrics" consul_sd_configs: - server: consul.service.par.consul.prod.crto.in:8500 tag: marathon-user-observability # Used in After refresh_interval: 30s # Used in After+delay relabel_configs: - source_labels: [__meta_consul_tags] regex: ^(.,)?marathon-user-observability(,.)?$ action: keep - job_name: "observability-by-name" scrape_interval: "60s" metrics_path: "/metrics" consul_sd_configs: - server: consul.service.par.consul.prod.crto.in:8500 services: - observability-cerebro - observability-portal-web - job_name: "fake-fake-fake" scrape_interval: "15s" metrics_path: "/metrics" consul_sd_configs: - server: consul.service.par.consul.prod.crto.in:8500 services: - fake-fake-fake ``` Note: tested with ~1200 services, ~5000 nodes. \| Resource \| Empty \| Before \| After \| After + delay \| \| -------- \|:-----:\|:------:\|:-----:\|:-------------:\| \|/service-discovery size\|5K\|85MiB\|27k\|27k\|27k\| \|`go_memstats_heap_objects`\|100k\|1M\|120k\|110k\| \|`go_memstats_heap_alloc_bytes`\|24MB\|150MB\|28MB\|27MB\| \|`rate(go_memstats_alloc_bytes_total[5m])`\|0.2MB/s\|28MB/s\|2MB/s\|0.3MB/s\| \|`rate(process_cpu_seconds_total[5m])`\|0.1%\|15%\|2%\|0.01%\| \|`process_open_fds`\|16\|1236\|22\|22\| \|`rate(prometheus_sd_consul_rpc_duration_seconds_count{call="services"}[5m])`\|~0\|1\|1\|0.03\| \|`rate(prometheus_sd_consul_rpc_duration_seconds_count{call="service"}[5m])`\|0.1\|80\|0.5\|0.5\| \|`prometheus_target_sync_length_seconds{quantile="0.9",scrape_job="observability-by-tag"}`\|N/A\|200ms\|0.2ms\|0.2ms\| \|Network bandwidth\|~10kbps\|~2.8Mbps\|~1.6Mbps\|~10kbps\| Filtering by tag using relabel_configs uses 100kiB and 23kiB/s per service per job and quite a lot of CPU. Also sends and additional 1Mbps of traffic to consul. Being a little bit smarter about this reduces the overhead quite a lot. Limiting the number of `/catalog/services` queries per second almost removes the overhead of service discovery. * consul: tweak `refresh_interval` behavior `refresh_interval` now does what is advertised in the documentation, there won't be more that one update per `refresh_interval`. It now defaults to 30s (which was also the current waitTime in the consul query). This also make sure we don't wait another 30s if we already waited 29s in the blocking call by substracting the number of elapsed seconds. Hopefully this will do what people expect it does and will be safer for existing consul infrastructures.		2018-03-23 14:48:43 +00:00
..
bearertoken.bad.yml	Configuration options for bearer tokens, client certs & CA certs	2015-08-04 17:18:46 +01:00
bearertoken_basicauth.bad.yml	Configuration options for bearer tokens, client certs & CA certs	2015-08-04 17:18:46 +01:00
conf.good.yml	consul: improve consul service discovery (#3814 )	2018-03-23 14:48:43 +00:00
first.rules	Fix regression of alert rules state loss on config reload. (#3382 )	2017-11-01 12:58:00 +01:00
global_timeout.good.yml	Fix global config YAML issues	2016-02-15 14:08:25 +01:00
jobname.bad.yml	Allow number to be the first letter as well for `job_name`	2016-09-16 14:06:47 +03:00
jobname_dup.bad.yml	Switch config to YAML format.	2015-05-07 16:52:14 +02:00
kubernetes_bearertoken.bad.yml	config: adapt unit tests	2016-10-17 10:32:10 +02:00
kubernetes_bearertoken_basicauth.bad.yml	config: adapt unit tests	2016-10-17 10:32:10 +02:00
kubernetes_namespace_discovery.bad.yml	Allow limiting Kubernetes service discover to certain namespaces	2017-04-27 07:41:36 -04:00
kubernetes_role.bad.yml	config: validate Kubernetes role correctly.	2016-07-18 22:24:41 +09:00
labeldrop.bad.yml	Stricter Relabel Config Checking for Labeldrop/keep (#2510 )	2017-03-18 22:32:08 +01:00
labeldrop2.bad.yml	Stricter Relabel Config Checking for Labeldrop/keep (#2510 )	2017-03-18 22:32:08 +01:00
labeldrop3.bad.yml	Stricter Relabel Config Checking for Labeldrop/keep (#2510 )	2017-03-18 22:32:08 +01:00
labeldrop4.bad.yml	Stricter Relabel Config Checking for Labeldrop/keep (#2510 )	2017-03-18 22:32:08 +01:00
labeldrop5.bad.yml	Stricter Relabel Config Checking for Labeldrop/keep (#2510 )	2017-03-18 22:32:08 +01:00
labelkeep.bad.yml	Stricter Relabel Config Checking for Labeldrop/keep (#2510 )	2017-03-18 22:32:08 +01:00
labelkeep2.bad.yml	Stricter Relabel Config Checking for Labeldrop/keep (#2510 )	2017-03-18 22:32:08 +01:00
labelkeep3.bad.yml	Stricter Relabel Config Checking for Labeldrop/keep (#2510 )	2017-03-18 22:32:08 +01:00
labelkeep4.bad.yml	Stricter Relabel Config Checking for Labeldrop/keep (#2510 )	2017-03-18 22:32:08 +01:00
labelkeep5.bad.yml	Stricter Relabel Config Checking for Labeldrop/keep (#2510 )	2017-03-18 22:32:08 +01:00
labelmap.bad.yml	Prevent invalid label names with labelmap (#3868 )	2018-02-21 10:02:22 +00:00
labelname.bad.yml	Rename global "labels" config option to "external_labels".	2015-09-29 20:54:20 +02:00
labelname2.bad.yml	Rename global "labels" config option to "external_labels".	2015-09-29 20:54:20 +02:00
marathon_no_servers.bad.yml	Fix missing unmarshal for Marathon SD config.	2015-09-06 20:02:22 +02:00
modulus_missing.bad.yml	Add 'hashmod' relabel action.	2015-06-24 21:14:53 +01:00
regex.bad.yml	Switch config to YAML format.	2015-05-07 16:52:14 +02:00
remote_read_url_missing.bad.yml	Make sure that url for remote_read/write is not nil (#3024 )	2017-08-07 08:49:45 +01:00
remote_write_url_missing.bad.yml	Make sure that url for remote_read/write is not nil (#3024 )	2017-08-07 08:49:45 +01:00
rules.bad.yml	Load rule files from entire directories	2015-06-01 21:12:31 +02:00
rules_abs_path.good.yml	Fixing tests for Windows	2017-07-09 01:59:30 -03:00
rules_abs_path_windows.good.yml	Fixing tests for Windows	2017-07-09 01:59:30 -03:00
scrape_interval.bad.yml	Restrict scrape timeout to interval length	2016-02-12 12:52:22 +01:00
static_config.bad.json	config: deprecate `target_groups` for `static_configs`	2016-06-08 15:55:25 +02:00
target_label_hashmod_missing.bad.yml	Forbid invalid relabel configurations	2016-08-29 16:56:06 +02:00
target_label_missing.bad.yml	Forbid invalid relabel configurations	2016-08-29 16:56:06 +02:00
unknown_attr.bad.yml	Rename global "labels" config option to "external_labels".	2015-09-29 20:54:20 +02:00
unknown_global_attr.bad.yml	config: Fix overflow checking in global config (#2783 )	2017-05-30 20:58:06 +02:00
url_in_targetgroup.bad.yml	config: deprecate `target_groups` for `static_configs`	2016-06-08 15:55:25 +02:00