Commit graph

152 commits

Author SHA1 Message Date
beorn7 a4e4bec3fe Merge branch 'release-2.2' 2018-04-30 14:38:29 +02:00
Elif T. Kuş 57dcdfb15f Rewrote tests with testutil for several test files (#4086)
* promql: Rewrote tests with testutil for functions_test

Signed-off-by: Elif T. Kuş <elifkus@gmail.com>

* pkg/relabel: Rewrote tests with testutil for relabel_test

Signed-off-by: Elif T. Kuş <elifkus@gmail.com>

* discovery/consul: Rewrote tests with testutil for consul_test

Signed-off-by: Elif T. Kuş <elifkus@gmail.com>

* scrape: Rewrote tests with testutil for manager_test

Signed-off-by: Elif T. Kuş <elifkus@gmail.com>
2018-04-27 13:11:16 +01:00
Yecheng Fu 2be543e65a Simplify some code and comments.
Signed-off-by: Yecheng Fu <cofyc.jackson@gmail.com>
2018-04-25 19:29:34 +02:00
Yecheng Fu 46683dd67d Simplify code.
- Unified `send` function.
- Pass InformerSynced functions to `cache.WaitForCacheSync`.
- Use `Role\w+` constants instead of literal string.

Signed-off-by: Yecheng Fu <cofyc.jackson@gmail.com>
2018-04-25 19:29:21 +02:00
Yecheng Fu 3a253f796c Fix grammar in comments and add missing expectedMaxItems to let it
break fast.

Signed-off-by: Yecheng Fu <cofyc.jackson@gmail.com>
2018-04-25 19:29:03 +02:00
Yecheng Fu d73b0d3141 Move hasSynced interface and its implementations to *_test.go files.
Signed-off-by: Yecheng Fu <cofyc.jackson@gmail.com>
2018-04-25 19:28:49 +02:00
Yecheng Fu 8ceb8f2ae8 Refactor Kubernetes Discovery Part 2: Refactoring
- Do initial listing and syncing to scrape manager, then register event
  handlers may lost events happening in listing and syncing (if it
  lasted a long time). We should register event handlers at the very
  begining, before processing just wait until informers synced (sync in
  informer will list all objects and call OnUpdate event handler).
- Use a queue then we don't block event callbacks and an object will be
  processed only once if added multiple times before it being processed.
- Fix bug in `serviceUpdate` in endpoints.go, we should build endpoints
  when `exists && err == nil`. Add `^TestEndpointsDiscoveryWithService`
  tests to test this feature.

Testing:

- Use `k8s.io/client-go` testing framework and fake implementations which are
  more robust and reliable for testing.
- `Test\w+DiscoveryBeforeRun` are used to test objects created before
  discoverer runs
- `Test\w+DiscoveryAdd\w+` are used to test adding objects
- `Test\w+DiscoveryDelete\w+` are used to test deleting objects
- `Test\w+DiscoveryUpdate\w+` are used to test updating objects
- `TestEndpointsDiscoveryWithService\w+` are used to test endpoints
  events triggered by services
- `cache.DeletedFinalStateUnknown` related stuffs are removed, because
  we don't care deleted objects in store, we only need its name to send
  a specical `targetgroup.Group` to scrape manager

Signed-off-by: Yecheng Fu <cofyc.jackson@gmail.com>
2018-04-25 19:28:34 +02:00
Adam Shannon 809881d7f5 support reading basic_auth password_file for HTTP basic auth (#4077)
Issue: https://github.com/prometheus/prometheus/issues/4076

Signed-off-by: Adam Shannon <adamkshannon@gmail.com>
2018-04-25 18:19:06 +01:00
Rohit Gupta 30c3e02864 Fixes #4090. Marathon service discovery for 5XX http response (#4091)
Signed-off-by: rohit01 <hello@rohit.io>
2018-04-17 09:28:06 +01:00
sev3ryn cc917aee7f fix of endless loop while doing Consul service discovery. (#4044)
Reloading Prometheus configs doesn't make loop end.
It produced a goroutine leak
2018-04-05 10:41:09 +01:00
Philippe Laflamme 2aba238f31 Use common HTTPClientConfig for marathon_sd configuration (#4009)
This adds support for basic authentication which closes #3090

The support for specifying the client timeout was removed as discussed in https://github.com/prometheus/common/pull/123. Marathon was the only sd mechanism doing this and configuring the timeout is done through `Context`.

DC/OS uses a custom `Authorization` header for authenticating. This adds 2 new configuration properties to reflect this.

Existing configuration files that use the bearer token will no longer work. More work is required to make this backwards compatible.
2018-04-05 09:08:18 +01:00
Manos Fokas 25f929b772 Yaml UnmarshalStrict implementation. (#4033)
* Updated yaml vendor package.

* remove checkOverflow duplicate in rulefmt

* remove duplicated HTTPClientConfig.Validate()

* Added yaml static check.
2018-04-04 09:07:39 +01:00
albatross0 0245fd55bf Add a machine type label to GCE SD (#4032) 2018-03-31 09:20:19 +01:00
Kristiyan Nikolov be85ba3842 discovery/ec2: Support filtering instances in discovery (#4011) 2018-03-31 07:51:11 +01:00
Corentin Chary 60dafd425c consul: improve consul service discovery (#3814)
* consul: improve consul service discovery

Related to #3711

- Add the ability to filter by tag and node-meta in an efficient way (`/catalog/services`
  allow filtering by node-meta, and returns a `map[string]string` or `service`->`tags`).
  Tags and nore-meta are also used in `/catalog/service` requests.
- Do not require a call to the catalog if services are specified by name. This is important
  because on large cluster `/catalog/services` changes all the time.
- Add `allow_stale` configuration option to do stale reads. Non-stale
  reads can be costly, even more when you are doing them to a remote
  datacenter with 10k+ targets over WAN (which is common for federation).
- Add `refresh_interval` to minimize the strain on the catalog and on the
  service endpoint. This is needed because of that kind of behavior from
  consul: https://github.com/hashicorp/consul/issues/3712 and because a catalog
  on a large cluster would basically change *all* the time. No need to discover
  targets in 1sec if we scrape them every minute.
- Added plenty of unit tests.

Benchmarks
----------

```yaml
scrape_configs:

- job_name: prometheus
  scrape_interval: 60s
  static_configs:
    - targets: ["127.0.0.1:9090"]

- job_name: "observability-by-tag"
  scrape_interval: "60s"
  metrics_path: "/metrics"
  consul_sd_configs:
    - server: consul.service.par.consul.prod.crto.in:8500
      tag: marathon-user-observability  # Used in After
      refresh_interval: 30s             # Used in After+delay
  relabel_configs:
    - source_labels: [__meta_consul_tags]
      regex: ^(.*,)?marathon-user-observability(,.*)?$
      action: keep

- job_name: "observability-by-name"
  scrape_interval: "60s"
  metrics_path: "/metrics"
  consul_sd_configs:
    - server: consul.service.par.consul.prod.crto.in:8500
      services:
        - observability-cerebro
        - observability-portal-web

- job_name: "fake-fake-fake"
  scrape_interval: "15s"
  metrics_path: "/metrics"
  consul_sd_configs:
    - server: consul.service.par.consul.prod.crto.in:8500
      services:
        - fake-fake-fake
```

Note: tested with ~1200 services, ~5000 nodes.

| Resource | Empty | Before | After | After + delay |
| -------- |:-----:|:------:|:-----:|:-------------:|
|/service-discovery size|5K|85MiB|27k|27k|27k|
|`go_memstats_heap_objects`|100k|1M|120k|110k|
|`go_memstats_heap_alloc_bytes`|24MB|150MB|28MB|27MB|
|`rate(go_memstats_alloc_bytes_total[5m])`|0.2MB/s|28MB/s|2MB/s|0.3MB/s|
|`rate(process_cpu_seconds_total[5m])`|0.1%|15%|2%|0.01%|
|`process_open_fds`|16|*1236*|22|22|
|`rate(prometheus_sd_consul_rpc_duration_seconds_count{call="services"}[5m])`|~0|1|1|*0.03*|
|`rate(prometheus_sd_consul_rpc_duration_seconds_count{call="service"}[5m])`|0.1|*80*|0.5|0.5|
|`prometheus_target_sync_length_seconds{quantile="0.9",scrape_job="observability-by-tag"}`|N/A|200ms|0.2ms|0.2ms|
|Network bandwidth|~10kbps|~2.8Mbps|~1.6Mbps|~10kbps|

Filtering by tag using relabel_configs uses **100kiB and 23kiB/s per service per job** and quite a lot of CPU. Also sends and additional *1Mbps* of traffic to consul.
Being a little bit smarter about this reduces the overhead quite a lot.
Limiting the number of `/catalog/services` queries per second almost removes the overhead of service discovery.

* consul: tweak `refresh_interval` behavior

`refresh_interval` now does what is advertised in the documentation,
there won't be more that one update per `refresh_interval`. It now
defaults to 30s (which was also the current waitTime in the consul query).

This also make sure we don't wait another 30s if we already waited 29s
in the blocking call by substracting the number of elapsed seconds.

Hopefully this will do what people expect it does and will be safer
for existing consul infrastructures.
2018-03-23 14:48:43 +00:00
Ben Kochie 0d9fe18f5e Fix nil context staticcheck error. 2018-03-22 07:59:39 +00:00
Aaron Kirkbride c47fbcb626 Fix moved fsnotify dependency (#3995) 2018-03-21 15:46:31 +00:00
Jeeyoung Kim 5b962c5748 Revert "Feature: Allow getting credentials via EC2 role (#3343)" (#3985)
This reverts commit 808f79f00a.
2018-03-20 12:34:54 +00:00
Matt Palmer 042090a6d3 [dns_sd] Send an EDNS0 query by default (#3586)
Based on https://groups.google.com/d/topic/prometheus-users/02kezHbuea4/discussion

Does not attempt to handle a situation where the server does not understand
EDNS0, however that is an unlikely case, and the behaviour of such ancient
systems is hard to predict in advance, so if it does come up, it will need
to be handled on a case-by-case basis.
2018-03-09 10:21:58 +00:00
Yecheng Fu 56ed29fbf7 Map target infos of endpoints to prometheus meta labels. (#3770) 2018-03-09 10:07:00 +00:00
Marek Siarkowicz 86011047ca Validate required fields in sd configuration (#3911) 2018-03-05 19:27:54 +00:00
Krasi Georgiev 6b0e9ef183 Validate json parse for TargetGroup Unmarshal (#3614)
Using DisallowUnknownFields in golang 1.10 to forbid unknown fields in targetGroups
added the license header for the targetGroup test
2018-02-27 12:33:27 +00:00
Krasi Georgiev 4fa7e719f4 race in Triton SD Test (#3885) 2018-02-26 10:03:50 +00:00
ferhat elmas ffa673f7d8 General simplifications (#3887)
Another try as in #1516
2018-02-26 07:58:10 +00:00
Pedro Araújo 575f665944 Add OS type meta label to Azure SD (#3863)
There is currently no way to differentiate Windows instances from Linux
ones. This is needed when you have a mix of node_exporters /
wmi_exporters for OS-level metrics and you want to have them in separate
scrape jobs.

This change allows you to do just that. Example:

```
  - job_name: 'node'
    azure_sd_configs:
      - <azure_sd_config>
    relabel_configs:
      - source_labels: [__meta_azure_machine_os_type]
        regex: Linux
        action: keep
```

The way the vendor'd AzureSDK provides to get the OsType is a bit
awkward - as far as I can tell, this information can only be gotten from
the startup disk. Newer versions of the SDK appear to improve this a
bit (by having OS information in the InstanceView), but the current way
still works.
2018-02-19 15:40:57 +00:00
Simon Pasquier 2072bbc824 Send update when pod's IP address is empty
When the pod gets evicted, its IP address becomes empty and it needs to
be removed from the targets.
2018-02-14 14:23:52 +01:00
Krasi Georgiev b75428ec19 rename package retrieve to scrape
no fucnctinal changes just renaming retrieval to scrape
2018-02-01 09:55:07 +00:00
Frederic Branczyk d3ae1ac40e
Merge pull request #3741 from krasi-georgiev/discovery-race
read/write race for the  context field in the discovery package
2018-01-30 18:17:09 +01:00
pasquier-s bde64cf5a6 Fix Kubernetes endpoints SD for empty subsets (#3660)
* Fix Kubernetes endpoints SD for empty subsets

When an endpoints object has no associated pods (replica scaled to zero
for instance), the endpoints SD should return a target group with no
targets so that the SD manager propagates this information to the scrape
manager.

Fixes #3659

* Don't send nil target groups from the Kubernetes SD

This is to be consistent with the endpoints SD part.
2018-01-30 15:00:33 +00:00
Krasi Georgiev 818dda72db updated the sd tests 2018-01-29 15:19:15 +00:00
Krasi Georgiev acc4197098 remove dicovery race for the context field 2018-01-29 15:18:07 +00:00
Frederic Branczyk 47538cf6ce
Merge pull request #3747 from prometheus/sched-update-throttle
Update throttle & tsdb update
2018-01-29 16:05:05 +01:00
Frederic Branczyk 73e829137b
discovery: Cleanup ticker 2018-01-29 13:51:04 +01:00
Ganesh Vernekar 66b0aa3b45 Fixed race condition in map iteration and map write in Discovery (#3735) (#3738)
* Fixed concurrent map iteration and map write in Discovery (#3735)

* discovery: Changed Lock to RLock in Collect
2018-01-28 22:24:31 +05:30
Krasi Georgiev fe926e7829 update the discover tests
the discovery test is now only testing update and get groups.
It doesn't do an e2e test but just a unit test of setting and receiving
target groups
2018-01-27 12:03:06 +00:00
Callum Styan 7dc05538f7 docs: SD implementations do not have to only send new/changed target groups (#3713) 2018-01-26 22:03:11 +00:00
Frederic Branczyk cfa0253ed8
discovery: Schedule updates to throttle 2018-01-26 16:24:44 +01:00
zemek 8a01a0fbed Set consul server default to localhost:8500 (#3703) 2018-01-24 12:14:32 +00:00
Julius Volz 09e460a647
discovery: Rename file SD mtime metric (#3723)
- "timestamp" -> "mtime" to be in line with node exporter and clearer.
- add unit suffix
2018-01-22 14:02:24 +01:00
Krasi Georgiev ec26751fd2 use mutexes for the discovery manager instead of a loop as this was a stupid idea 2018-01-17 18:12:58 +00:00
Krasi Georgiev 767faa44b6 fixed the tests
Signed-off-by: Krasi Georgiev <krasi.root@gmail.com>
2018-01-15 13:39:47 +00:00
Krasi Georgiev d12e6f29fc discovery manager ApplyConfig now takes a direct ServiceDiscoveryConfig so that it can be used for the notify manager
reimplement the service discovery for the notify manager

Signed-off-by: Krasi Georgiev <krasi.root@gmail.com>
2018-01-15 13:39:44 +00:00
Goutham Veeramachaneni b20a1b1b1b
Merge pull request #3654 from krasi-georgiev/discovery-handle-discoverer-updates
discovery - handle Discoverers that send only target Group updates.
2018-01-15 18:53:22 +05:30
Krasi Georgiev 790cf30fcb remove uneeded check 2018-01-15 11:52:20 +00:00
Krasi Georgiev 38938ba493 comment nits 2018-01-15 11:47:36 +00:00
Krasi Georgiev febebcd49a more comments for the future ME, and reverted the Discovery manager execution changes as these were correct in the first place 2018-01-12 22:07:21 +00:00
Krasi Georgiev 78ba5e62a6 few mote usefull comments 2018-01-12 13:58:23 +00:00
Krasi Georgiev cabce21b70 delete empty targets sets to avoid memory leaks 2018-01-12 13:10:59 +00:00
Krasi Georgiev abfd9f1920 nits 2018-01-12 12:19:52 +00:00
Shubheksha Jalan 0471e64ad1 Use shared types from the common repo (#3674)
* refactor: use shared types from common repo, remove util/config

* vendor: add common/config

* fix nit
2018-01-11 16:10:25 +01:00