Commit graph

621 commits

Author SHA1 Message Date
Corentin Chary 60dafd425c consul: improve consul service discovery (#3814)
* consul: improve consul service discovery

Related to #3711

- Add the ability to filter by tag and node-meta in an efficient way (`/catalog/services`
  allow filtering by node-meta, and returns a `map[string]string` or `service`->`tags`).
  Tags and nore-meta are also used in `/catalog/service` requests.
- Do not require a call to the catalog if services are specified by name. This is important
  because on large cluster `/catalog/services` changes all the time.
- Add `allow_stale` configuration option to do stale reads. Non-stale
  reads can be costly, even more when you are doing them to a remote
  datacenter with 10k+ targets over WAN (which is common for federation).
- Add `refresh_interval` to minimize the strain on the catalog and on the
  service endpoint. This is needed because of that kind of behavior from
  consul: https://github.com/hashicorp/consul/issues/3712 and because a catalog
  on a large cluster would basically change *all* the time. No need to discover
  targets in 1sec if we scrape them every minute.
- Added plenty of unit tests.

Benchmarks
----------

```yaml
scrape_configs:

- job_name: prometheus
  scrape_interval: 60s
  static_configs:
    - targets: ["127.0.0.1:9090"]

- job_name: "observability-by-tag"
  scrape_interval: "60s"
  metrics_path: "/metrics"
  consul_sd_configs:
    - server: consul.service.par.consul.prod.crto.in:8500
      tag: marathon-user-observability  # Used in After
      refresh_interval: 30s             # Used in After+delay
  relabel_configs:
    - source_labels: [__meta_consul_tags]
      regex: ^(.*,)?marathon-user-observability(,.*)?$
      action: keep

- job_name: "observability-by-name"
  scrape_interval: "60s"
  metrics_path: "/metrics"
  consul_sd_configs:
    - server: consul.service.par.consul.prod.crto.in:8500
      services:
        - observability-cerebro
        - observability-portal-web

- job_name: "fake-fake-fake"
  scrape_interval: "15s"
  metrics_path: "/metrics"
  consul_sd_configs:
    - server: consul.service.par.consul.prod.crto.in:8500
      services:
        - fake-fake-fake
```

Note: tested with ~1200 services, ~5000 nodes.

| Resource | Empty | Before | After | After + delay |
| -------- |:-----:|:------:|:-----:|:-------------:|
|/service-discovery size|5K|85MiB|27k|27k|27k|
|`go_memstats_heap_objects`|100k|1M|120k|110k|
|`go_memstats_heap_alloc_bytes`|24MB|150MB|28MB|27MB|
|`rate(go_memstats_alloc_bytes_total[5m])`|0.2MB/s|28MB/s|2MB/s|0.3MB/s|
|`rate(process_cpu_seconds_total[5m])`|0.1%|15%|2%|0.01%|
|`process_open_fds`|16|*1236*|22|22|
|`rate(prometheus_sd_consul_rpc_duration_seconds_count{call="services"}[5m])`|~0|1|1|*0.03*|
|`rate(prometheus_sd_consul_rpc_duration_seconds_count{call="service"}[5m])`|0.1|*80*|0.5|0.5|
|`prometheus_target_sync_length_seconds{quantile="0.9",scrape_job="observability-by-tag"}`|N/A|200ms|0.2ms|0.2ms|
|Network bandwidth|~10kbps|~2.8Mbps|~1.6Mbps|~10kbps|

Filtering by tag using relabel_configs uses **100kiB and 23kiB/s per service per job** and quite a lot of CPU. Also sends and additional *1Mbps* of traffic to consul.
Being a little bit smarter about this reduces the overhead quite a lot.
Limiting the number of `/catalog/services` queries per second almost removes the overhead of service discovery.

* consul: tweak `refresh_interval` behavior

`refresh_interval` now does what is advertised in the documentation,
there won't be more that one update per `refresh_interval`. It now
defaults to 30s (which was also the current waitTime in the consul query).

This also make sure we don't wait another 30s if we already waited 29s
in the blocking call by substracting the number of elapsed seconds.

Hopefully this will do what people expect it does and will be safer
for existing consul infrastructures.
2018-03-23 14:48:43 +00:00
Yecheng Fu 56ed29fbf7 Map target infos of endpoints to prometheus meta labels. (#3770) 2018-03-09 10:07:00 +00:00
Fabian Reinartz 3e6c890aea api: add flag to skip head on snapshots 2018-03-08 13:07:12 +01:00
Jeffrey Zhang 21f96caab3 Fix wrong syntax for alert field templates (#3883) 2018-02-24 09:37:43 +00:00
Conor Broderick 99006d3baf Added dropped targets API to targets endpoint (#3870) 2018-02-21 17:26:18 +00:00
Conor Broderick 1fd20fc954 Add dropped alertmanagers to alertmanagers API (#3865) 2018-02-21 09:00:07 +00:00
Bartek Plotka 93a63ac5fd api: Added v1/status/flags endpoint. (#3864)
Endpoint URL: /api/v1/status/flags
Example Output:
```json
{
  "status": "success",
  "data": {
    "alertmanager.notification-queue-capacity": "10000",
    "alertmanager.timeout": "10s",
    "completion-bash": "false",
    "completion-script-bash": "false",
    "completion-script-zsh": "false",
    "config.file": "my_cool_prometheus.yaml",
    "help": "false",
    "help-long": "false",
    "help-man": "false",
    "log.level": "info",
    "query.lookback-delta": "5m",
    "query.max-concurrency": "20",
    "query.timeout": "2m",
    "storage.tsdb.max-block-duration": "36h",
    "storage.tsdb.min-block-duration": "2h",
    "storage.tsdb.no-lockfile": "false",
    "storage.tsdb.path": "data/",
    "storage.tsdb.retention": "15d",
    "version": "false",
    "web.console.libraries": "console_libraries",
    "web.console.templates": "consoles",
    "web.enable-admin-api": "false",
    "web.enable-lifecycle": "false",
    "web.external-url": "",
    "web.listen-address": "0.0.0.0:9090",
    "web.max-connections": "512",
    "web.read-timeout": "5m",
    "web.route-prefix": "/",
    "web.user-assets": ""
  }
}
```

Signed-off-by: Bartek Plotka <bwplotka@gmail.com>
2018-02-21 08:49:02 +00:00
Pedro Araújo 575f665944 Add OS type meta label to Azure SD (#3863)
There is currently no way to differentiate Windows instances from Linux
ones. This is needed when you have a mix of node_exporters /
wmi_exporters for OS-level metrics and you want to have them in separate
scrape jobs.

This change allows you to do just that. Example:

```
  - job_name: 'node'
    azure_sd_configs:
      - <azure_sd_config>
    relabel_configs:
      - source_labels: [__meta_azure_machine_os_type]
        regex: Linux
        action: keep
```

The way the vendor'd AzureSDK provides to get the OsType is a bit
awkward - as far as I can tell, this information can only be gotten from
the startup disk. Newer versions of the SDK appear to improve this a
bit (by having OS information in the InstanceView), but the current way
still works.
2018-02-19 15:40:57 +00:00
Andrea Giardini 3a9637fa3c docs: Fix remote_read/remote_timeout default (#3829) 2018-02-12 12:52:33 +00:00
Brian Brazil 66b8bdbf4a
Fix docs for #3820 (#3823) 2018-02-11 23:35:08 +00:00
Ben Kochie 40acc632bb
Merge pull request #3505 from rdemachkovych/ansible_prom2.0
Added to documentation Ansible roles for Prometheus 2.0
2018-01-26 11:30:15 +01:00
Roman Demachkovych 8bfc611616 Remove not maintained roles 2018-01-26 09:46:44 +01:00
zemek 8a01a0fbed Set consul server default to localhost:8500 (#3703) 2018-01-24 12:14:32 +00:00
James Turnbull 00f4821178 Added missing ingress from role list (#3666) 2018-01-08 21:23:01 +00:00
James Turnbull 380cacd3a4 Readability edits to vector matching (#3624)
* Added L3 headings - makes page a little easier to read

* Made use of right-hand and left-hand consistent
2017-12-26 10:28:39 +00:00
Brian Brazil fba80da635
Fix default of read_recent to be false. (#3617)
This is what is documented in the migration guide, and the default settings
should make sense for a true long term storage.

Document the setting.
2017-12-23 17:21:38 +00:00
James Turnbull c3f9238756 Updated alert templating docs (#3596)
The docs suggest that alert templating only works in the summary and
description annotation fields. Some testing and a review of the code
suggests this is no longer true and that you can template any
annotation field.
2017-12-19 08:04:06 +00:00
Brian Brazil 9083d41d3a
Add 2.0 stability guarantees (#3484)
As discussed generally consider SDs as unstable, as realistically they
are never going to be. Drop the words "experimental/beta" from most
places in the docs, as users are getting the wrong impression from this.
2017-12-14 12:54:32 +00:00
Simon Pasquier aa25dff1ea Update the openstack_sd_config section
openstack_sd_config requires a 'role' parameter which wasn't documented.
2017-12-14 12:20:28 +00:00
Krasi Georgiev 08ee713c82 example to show the difference between "sum by" and "sum without" (#3558) 2017-12-14 12:20:28 +00:00
vthriller b4bd91958a [minor] docs: recording_rules: fix missing key 2017-12-14 12:20:28 +00:00
Tobias Schmidt 28205f5ca9 Remove wrong statement about alertmanager URL configuration 2017-12-14 12:20:28 +00:00
Mike Rostermund 4648f4c156 New server uses read protocol, to eh, read. (#3444) 2017-12-14 12:20:28 +00:00
Brian Brazil e0711c2e9b Document consul sd tls_config (#3440)
Fixes https://github.com/prometheus/docs/issues/681
2017-12-14 12:20:28 +00:00
Tom Wilkie d2f6803d14 'Prometheus lifecycle' should be a subsection of 'Miscellaneous' 2017-12-14 12:20:28 +00:00
Or Elimelech 6e8d192ba0 Wrong URL for remote.proto (#3431)
Change wrong URL for remote.proto
2017-12-14 12:20:28 +00:00
phyber 013dc30dee Fix markdown in recording rules. (#3432)
Resolves an issue where rendered markdown was incorrect.
2017-12-14 12:20:28 +00:00
Tobias Schmidt 87f5fe3576 Fix migration documentation title in docs menu 2017-12-14 12:20:28 +00:00
Brian Brazil 5dff97639f Tweak migration doc (#3430) 2017-12-14 12:20:28 +00:00
Jose Donizetti b3b6538348 Small changes to migration guide 2017-12-14 12:20:28 +00:00
Goutham Veeramachaneni bee6864c14 Make the date returned by snapshot script friendly
Fixes #3568

Signed-off-by: Goutham Veeramachaneni <cs14btech11014@iith.ac.in>
2017-12-10 15:14:31 -06:00
Goutham Veeramachaneni e0d917e2f5
Merge pull request #3523 from Gouthamve/clean-tomb
Add endpoint to cleanup tombstones
2017-12-07 14:39:24 -06:00
Goutham Veeramachaneni f0599d4dbf Incorporate review-feedback
Signed-off-by: Goutham Veeramachaneni <cs14btech11014@iith.ac.in>
2017-12-07 09:06:04 -06:00
James Turnbull 330735aca6 Added another full link to the configuration docs (#3553) 2017-12-07 08:31:15 +00:00
Amy Holt 607a675617 Add prefix to relative 3 URLs (#3551) 2017-12-06 21:16:53 +00:00
Goutham Veeramachaneni 311edc5a38 Merge branch 'master' into clean-tomb
Signed-off-by: Goutham Veeramachaneni <cs14btech11014@iith.ac.in>
2017-12-05 10:23:21 -06:00
Goutham Veeramachaneni d8515b2580 Move Admin APIs to v1
Signed-off-by: Goutham Veeramachaneni <cs14btech11014@iith.ac.in>
2017-12-04 00:13:43 +05:30
Goutham Veeramachaneni 41b8f1f8fe
Add admin API docs
Signed-off-by: Goutham Veeramachaneni <cs14btech11014@iith.ac.in>
2017-12-02 15:37:31 +05:30
Matthias Rampke cae4538b3e Docs: state that all regular expressions are RE2. (#3518)
We already mentioned that regular expressions are RE2 for
[relabeling][0], but left open what the regular expression syntax
anywhere else is.

In the querying examples and reference, make it explicit that _all_
regular expressions are RE2.

[0]: https://prometheus.io/docs/prometheus/latest/configuration/configuration/#relabel_config
2017-12-01 17:26:06 +00:00
Roman Demachkovych e0ad66f5a6 fix link name 2017-11-27 18:22:27 +01:00
Roman Demachkovych 370d045f5d Change repo link 2017-11-27 18:14:12 +01:00
James Turnbull 47311bf005 Update configuration.md (#3513)
1. Removed https://prometheus.io prefix 
2. Fixed broken file discovery link.
2017-11-27 14:52:32 +00:00
Tom Wilkie 9d4e332137
Merge pull request #3495 from tomwilkie/pod-uid-discovery-master
Include Pod UID in the discovery metadata.
2017-11-24 15:37:57 +00:00
Tom Wilkie 7d4f7c4b71 Update docs for __meta_kubernetes_pod_uid 2017-11-24 15:02:53 +00:00
Roman Demachkovych 5e243bc556 fix link 2017-11-22 16:26:06 +01:00
Roman Demachkovych b758039f80 Added in to documentation Ansible roles for Prometheus 2.0 2017-11-22 16:15:46 +01:00
Ben Kochie 40f33f45cb Fix docs that use regexp anchors (#3504)
Remove/fix docs that use anchors in label regexp matches.
2017-11-22 12:11:21 +00:00
Tobias Schmidt 7098c56474 Add remote read filter option
For special remote read endpoints which have only data for specific
queries, it is desired to limit the number of queries sent to the
configured remote read endpoint to reduce latency and performance
overhead.
2017-11-13 23:30:01 +01:00
Tom Wilkie 617e7d0203 Add migration docs for 2.0 (#3374)
* Initial draft of migration.md

* Edits.

* Review feedback.

* Review feedback.

* Staleness link to video; add docker root example; remote config file section.

* s/NB/NOTE/, remove external labels link.

* More typos.

* Add more details link for removed PromQL features.

* s/you/your/

* Expand on prom1.8/2.0 side by side setup.

* More feedback.

* update links.

* --query.lookback-delta flag.
2017-11-08 08:14:33 +01:00
Julius Volz 02ca988bbd Remove /api/v1/delete_series docs for 2.0 (#3425)
This endpoint has moved to /api/v2 (with somewhat different properties)
in Prometheus 2.0 and should now be part of a separate admin API page.
2017-11-07 22:37:03 +00:00
Tobias Schmidt a117f051da Remove outdated information about next-release doc branch 2017-11-07 22:28:04 +01:00
Julius Volz ef08df0e6f
Add 2.0 storage docs (#3423)
* Add 2.0 storage docs

* Review fixups

* More review fixups
2017-11-07 22:00:38 +01:00
Brian Brazil a5b7955ace Tweak marathon wording around clustering. 2017-11-02 13:03:19 +00:00
Goutham Veeramachaneni 646e33242e docs: Fix minor issues with the docs. (#3389)
Signed-off-by: Goutham Veeramachaneni <cs14btech11014@iith.ac.in>
2017-11-01 15:35:50 +00:00
Brian Brazil b6494960d1
docs: Document new recording rule format (#3378) 2017-11-01 12:58:32 +00:00
Brian Brazil 7187771f20
Document new staleness (#3380)
Remove "interpolation" for this heading, that hasn't
been in these docs for a long time.
2017-11-01 12:40:47 +00:00
James Turnbull 3701a827cf Updates to alerting rules docs (#3381)
1. Added a further explanation of the for clause.
2. Added further clarification of non identifying labels.
2017-10-31 19:19:17 +00:00
Brian Brazil 8cf279efb1 Document new alerting rule format. 2017-10-31 14:46:34 +00:00
Brian Brazil efaa8f9ce8 Update getting started with new rules format 2017-10-31 13:58:09 +00:00
Fabian Reinartz a32e4cbdd8 docs: remove 1.x storage docs
The only section that still aplies was the one on the default storage
directory so those docs seem obsolete.
We'll probably have a similar page on the new storage but we'll only
find out what caveats etc. we'll have to point out as we get people
reporting problems or notable behavior.
2017-10-28 12:11:35 +02:00
Fabian Reinartz 8cc78b36a2 docs: remove obsolete info in getting started
Go automatically configures the number of used threads appropriately
and tweaking it is no longer relevant for a basic setup of Prometheus.
The baseline consumption tied to the storage layer no longer applies.
2017-10-28 12:09:03 +02:00
Fabian Reinartz 8a2b5a3936 docs: update flags to new double-dash syntax 2017-10-28 12:08:33 +02:00
Brian Brazil faf4bb03ee Docs: timestamp() function. 2017-10-27 15:54:45 +01:00
Brian Brazil aeb524ad14 Docs: remove keep_common, count_scalar, drop_common_labels 2017-10-27 15:54:45 +01:00
Tobias Schmidt f49ae044d7 Import template reference and examples 2017-10-27 16:08:38 +02:00
Tobias Schmidt f432b8176d Consolidate configuration and rules docs in docs/configuration/ 2017-10-27 09:54:02 +02:00
Tobias Schmidt 4d30a11ab6 Import storage and federation documentation from docs 2017-10-26 22:36:47 +02:00
Tobias Schmidt e6cdc2d355 Import querying documentation from prometheus/docs 2017-10-26 22:36:47 +02:00
Tobias Schmidt 299802dfd0 Integrate changes from prometheus/docs 2017-10-26 16:14:43 +02:00
Tobias Schmidt 41281aff81 Include 1.8 changes in configuration docs 2017-10-26 16:14:43 +02:00
Tobias Schmidt 53a5f52224 Import first batch of Prometheus documentation
In order to provide documentation for each individual version, this
commit starts moving Prometheus server specific documentation into the
repository itself.
2017-10-26 16:14:43 +02:00