Commit graph

390 commits

Author SHA1 Message Date
Jannick Fahlbusch ฏ๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎ 0be25f92e2 EC2 Discovery: Allow to set a custom endpoint (#4333)
Allowing to set a custom endpoint makes it easy to monitor targets on non AWS providers with EC2 compliant APIs.

Signed-off-by: Jannick Fahlbusch <git@jf-projects.de>
2018-07-18 10:48:14 +01:00
Romain Baugue b41be4ef52 Discovery consul service meta (#4280)
* Upgrade Consul client
* Add ServiceMeta to the labels in ConsulSD

Signed-off-by: Romain Baugue <romain.baugue@elwinar.com>
2018-07-18 05:06:56 +01:00
Simon Pasquier ed99af0b05 docs: fix OpenStack SD for the hypervisor role
Signed-off-by: Simon Pasquier <spasquie@redhat.com>
2018-07-15 12:37:57 +01:00
Marcin Owsiany 9fe8bcf4be Fix markup in example. (#4351)
Signed-off-by: Marcin Owsiany <marcin@owsiany.pl>
2018-07-05 09:13:00 +01:00
Damien Lespiau e64037053d Expose controller kind and name to labelling rules
Relabelling rules can use this information to attach the name of the controller
that has created a pod.

In turn, this can be used to slice metrics by workload at query time, ie.
"Give me all metrics that have been created by the $name Deployment"

Signed-off-by: Damien Lespiau <damien@weave.works>
2018-05-09 11:51:37 +02:00
Nathan Graves 5b27996cb3 Include GCE labels during service discovery. Updated vendor files for Google API. (#4150)
Signed-off-by: Nathan Graves <nathan.graves@kofile.us>
2018-05-08 17:37:47 +01:00
Daisy T b424eb42e3 document remote write queue parameters (#4126) 2018-04-30 20:08:45 +02:00
Brian Brazil fbe66819c5
Update ALERTS docs for 2.0 staleness changes. (#4116)
Signed-off-by: Brian Brazil <brian.brazil@robustperception.io>
2018-04-26 12:44:11 +01:00
Adam Shannon 809881d7f5 support reading basic_auth password_file for HTTP basic auth (#4077)
Issue: https://github.com/prometheus/prometheus/issues/4076

Signed-off-by: Adam Shannon <adamkshannon@gmail.com>
2018-04-25 18:19:06 +01:00
Philippe Laflamme 2aba238f31 Use common HTTPClientConfig for marathon_sd configuration (#4009)
This adds support for basic authentication which closes #3090

The support for specifying the client timeout was removed as discussed in https://github.com/prometheus/common/pull/123. Marathon was the only sd mechanism doing this and configuring the timeout is done through `Context`.

DC/OS uses a custom `Authorization` header for authenticating. This adds 2 new configuration properties to reflect this.

Existing configuration files that use the bearer token will no longer work. More work is required to make this backwards compatible.
2018-04-05 09:08:18 +01:00
albatross0 0245fd55bf Add a machine type label to GCE SD (#4032) 2018-03-31 09:20:19 +01:00
Kristiyan Nikolov be85ba3842 discovery/ec2: Support filtering instances in discovery (#4011) 2018-03-31 07:51:11 +01:00
Corentin Chary 60dafd425c consul: improve consul service discovery (#3814)
* consul: improve consul service discovery

Related to #3711

- Add the ability to filter by tag and node-meta in an efficient way (`/catalog/services`
  allow filtering by node-meta, and returns a `map[string]string` or `service`->`tags`).
  Tags and nore-meta are also used in `/catalog/service` requests.
- Do not require a call to the catalog if services are specified by name. This is important
  because on large cluster `/catalog/services` changes all the time.
- Add `allow_stale` configuration option to do stale reads. Non-stale
  reads can be costly, even more when you are doing them to a remote
  datacenter with 10k+ targets over WAN (which is common for federation).
- Add `refresh_interval` to minimize the strain on the catalog and on the
  service endpoint. This is needed because of that kind of behavior from
  consul: https://github.com/hashicorp/consul/issues/3712 and because a catalog
  on a large cluster would basically change *all* the time. No need to discover
  targets in 1sec if we scrape them every minute.
- Added plenty of unit tests.

Benchmarks
----------

```yaml
scrape_configs:

- job_name: prometheus
  scrape_interval: 60s
  static_configs:
    - targets: ["127.0.0.1:9090"]

- job_name: "observability-by-tag"
  scrape_interval: "60s"
  metrics_path: "/metrics"
  consul_sd_configs:
    - server: consul.service.par.consul.prod.crto.in:8500
      tag: marathon-user-observability  # Used in After
      refresh_interval: 30s             # Used in After+delay
  relabel_configs:
    - source_labels: [__meta_consul_tags]
      regex: ^(.*,)?marathon-user-observability(,.*)?$
      action: keep

- job_name: "observability-by-name"
  scrape_interval: "60s"
  metrics_path: "/metrics"
  consul_sd_configs:
    - server: consul.service.par.consul.prod.crto.in:8500
      services:
        - observability-cerebro
        - observability-portal-web

- job_name: "fake-fake-fake"
  scrape_interval: "15s"
  metrics_path: "/metrics"
  consul_sd_configs:
    - server: consul.service.par.consul.prod.crto.in:8500
      services:
        - fake-fake-fake
```

Note: tested with ~1200 services, ~5000 nodes.

| Resource | Empty | Before | After | After + delay |
| -------- |:-----:|:------:|:-----:|:-------------:|
|/service-discovery size|5K|85MiB|27k|27k|27k|
|`go_memstats_heap_objects`|100k|1M|120k|110k|
|`go_memstats_heap_alloc_bytes`|24MB|150MB|28MB|27MB|
|`rate(go_memstats_alloc_bytes_total[5m])`|0.2MB/s|28MB/s|2MB/s|0.3MB/s|
|`rate(process_cpu_seconds_total[5m])`|0.1%|15%|2%|0.01%|
|`process_open_fds`|16|*1236*|22|22|
|`rate(prometheus_sd_consul_rpc_duration_seconds_count{call="services"}[5m])`|~0|1|1|*0.03*|
|`rate(prometheus_sd_consul_rpc_duration_seconds_count{call="service"}[5m])`|0.1|*80*|0.5|0.5|
|`prometheus_target_sync_length_seconds{quantile="0.9",scrape_job="observability-by-tag"}`|N/A|200ms|0.2ms|0.2ms|
|Network bandwidth|~10kbps|~2.8Mbps|~1.6Mbps|~10kbps|

Filtering by tag using relabel_configs uses **100kiB and 23kiB/s per service per job** and quite a lot of CPU. Also sends and additional *1Mbps* of traffic to consul.
Being a little bit smarter about this reduces the overhead quite a lot.
Limiting the number of `/catalog/services` queries per second almost removes the overhead of service discovery.

* consul: tweak `refresh_interval` behavior

`refresh_interval` now does what is advertised in the documentation,
there won't be more that one update per `refresh_interval`. It now
defaults to 30s (which was also the current waitTime in the consul query).

This also make sure we don't wait another 30s if we already waited 29s
in the blocking call by substracting the number of elapsed seconds.

Hopefully this will do what people expect it does and will be safer
for existing consul infrastructures.
2018-03-23 14:48:43 +00:00
Yecheng Fu 56ed29fbf7 Map target infos of endpoints to prometheus meta labels. (#3770) 2018-03-09 10:07:00 +00:00
Jeffrey Zhang 21f96caab3 Fix wrong syntax for alert field templates (#3883) 2018-02-24 09:37:43 +00:00
Pedro Araújo 575f665944 Add OS type meta label to Azure SD (#3863)
There is currently no way to differentiate Windows instances from Linux
ones. This is needed when you have a mix of node_exporters /
wmi_exporters for OS-level metrics and you want to have them in separate
scrape jobs.

This change allows you to do just that. Example:

```
  - job_name: 'node'
    azure_sd_configs:
      - <azure_sd_config>
    relabel_configs:
      - source_labels: [__meta_azure_machine_os_type]
        regex: Linux
        action: keep
```

The way the vendor'd AzureSDK provides to get the OsType is a bit
awkward - as far as I can tell, this information can only be gotten from
the startup disk. Newer versions of the SDK appear to improve this a
bit (by having OS information in the InstanceView), but the current way
still works.
2018-02-19 15:40:57 +00:00
Andrea Giardini 3a9637fa3c docs: Fix remote_read/remote_timeout default (#3829) 2018-02-12 12:52:33 +00:00
zemek 8a01a0fbed Set consul server default to localhost:8500 (#3703) 2018-01-24 12:14:32 +00:00
James Turnbull 00f4821178 Added missing ingress from role list (#3666) 2018-01-08 21:23:01 +00:00
Brian Brazil fba80da635
Fix default of read_recent to be false. (#3617)
This is what is documented in the migration guide, and the default settings
should make sense for a true long term storage.

Document the setting.
2017-12-23 17:21:38 +00:00
James Turnbull c3f9238756 Updated alert templating docs (#3596)
The docs suggest that alert templating only works in the summary and
description annotation fields. Some testing and a review of the code
suggests this is no longer true and that you can template any
annotation field.
2017-12-19 08:04:06 +00:00
Brian Brazil 9083d41d3a
Add 2.0 stability guarantees (#3484)
As discussed generally consider SDs as unstable, as realistically they
are never going to be. Drop the words "experimental/beta" from most
places in the docs, as users are getting the wrong impression from this.
2017-12-14 12:54:32 +00:00
Simon Pasquier aa25dff1ea Update the openstack_sd_config section
openstack_sd_config requires a 'role' parameter which wasn't documented.
2017-12-14 12:20:28 +00:00
vthriller b4bd91958a [minor] docs: recording_rules: fix missing key 2017-12-14 12:20:28 +00:00
Tobias Schmidt 28205f5ca9 Remove wrong statement about alertmanager URL configuration 2017-12-14 12:20:28 +00:00
Brian Brazil e0711c2e9b Document consul sd tls_config (#3440)
Fixes https://github.com/prometheus/docs/issues/681
2017-12-14 12:20:28 +00:00
phyber 013dc30dee Fix markdown in recording rules. (#3432)
Resolves an issue where rendered markdown was incorrect.
2017-12-14 12:20:28 +00:00
James Turnbull 330735aca6 Added another full link to the configuration docs (#3553) 2017-12-07 08:31:15 +00:00
Amy Holt 607a675617 Add prefix to relative 3 URLs (#3551) 2017-12-06 21:16:53 +00:00
James Turnbull 47311bf005 Update configuration.md (#3513)
1. Removed https://prometheus.io prefix 
2. Fixed broken file discovery link.
2017-11-27 14:52:32 +00:00
Tom Wilkie 7d4f7c4b71 Update docs for __meta_kubernetes_pod_uid 2017-11-24 15:02:53 +00:00
Tobias Schmidt 7098c56474 Add remote read filter option
For special remote read endpoints which have only data for specific
queries, it is desired to limit the number of queries sent to the
configured remote read endpoint to reduce latency and performance
overhead.
2017-11-13 23:30:01 +01:00
Brian Brazil a5b7955ace Tweak marathon wording around clustering. 2017-11-02 13:03:19 +00:00
Goutham Veeramachaneni 646e33242e docs: Fix minor issues with the docs. (#3389)
Signed-off-by: Goutham Veeramachaneni <cs14btech11014@iith.ac.in>
2017-11-01 15:35:50 +00:00
Brian Brazil b6494960d1
docs: Document new recording rule format (#3378) 2017-11-01 12:58:32 +00:00
James Turnbull 3701a827cf Updates to alerting rules docs (#3381)
1. Added a further explanation of the for clause.
2. Added further clarification of non identifying labels.
2017-10-31 19:19:17 +00:00
Brian Brazil 8cf279efb1 Document new alerting rule format. 2017-10-31 14:46:34 +00:00
Fabian Reinartz 8a2b5a3936 docs: update flags to new double-dash syntax 2017-10-28 12:08:33 +02:00
Tobias Schmidt f49ae044d7 Import template reference and examples 2017-10-27 16:08:38 +02:00
Tobias Schmidt f432b8176d Consolidate configuration and rules docs in docs/configuration/ 2017-10-27 09:54:02 +02:00