Commit graph

183 commits

Author SHA1 Message Date
Fabian Reinartz b04ab71268
Merge pull request #4488 from jkohen/patch-3
Populate __meta_gce_instance_id discovery label
2018-08-11 09:52:28 +02:00
Javier Kohen 403ac08ece Expose __meta_gce_instance_id as an integer (instead of raw bytes).
Signed-off-by: Javier Kohen <jkohen@google.com>
2018-08-10 16:21:46 -04:00
Javier Kohen 7e9549b398 Added __meta_gce_instance_id discovery label
Populated from instance.ID. I will follow up with a change to the documentation.

Signed-off-by: Javier Kohen <jkohen@google.com>
2018-08-10 11:57:55 -04:00
Simon Pasquier b7054f3a78
Merge pull request #4443 from simonpasquier/fix-consul-connections-leak
discovery/consul: close idle connections on stop
2018-08-10 17:43:39 +02:00
Benji Visser 46fb4078a6 handle nil pointer in ec2 discovery (#4469)
This handles a nil pointer that was being accessed in EC2 discovery.

Fixes: #4441

Signed-off-by: noqcks <benny@noqcks.io>
2018-08-07 08:35:22 +01:00
Johannes Scheuermann f978f5bba3 Fixes #4202, correctly parse VMs with empty tags (#4450)
Signed-off-by: Johannes M. Scheuermann <joh.scheuer@gmail.com>
2018-08-02 10:10:17 +01:00
jojohappy e060f7755f To keep comment of NodeLegacyHostIP for k8s node address
Signed-off-by: jojohappy <sarahdj0917@gmail.com>
2018-08-02 10:25:28 +08:00
jojohappy e81785d1a3 To keep depecrate k8s node NodeLegacyHostIP as local constant to keep compatibility for older k8s version
Signed-off-by: jojohappy <sarahdj0917@gmail.com>
2018-08-02 10:25:28 +08:00
jojohappy 21e50a3f9d Upgrade k8s client to kubernetes-1.11.0
Signed-off-by: jojohappy <sarahdj0917@gmail.com>
2018-08-02 10:25:27 +08:00
Simon Pasquier 1cd29f782c discovery/consul: close idle connections on stop
Signed-off-by: Simon Pasquier <spasquie@redhat.com>
2018-08-01 17:26:52 +02:00
Johannes Scheuermann 7608ee87d0 Inital support for Azure VMSS (#4202)
* Inital support for Azure VMSS

Signed-off-by: Johannes Scheuermann <johannes.scheuermann@inovex.de>

* Add documentation for the newly introduced label

Signed-off-by: Johannes M. Scheuermann <joh.scheuer@gmail.com>
2018-08-01 12:52:21 +01:00
José Martínez 791c13b142 discovery/ec2: Add primary_subnet_id label
Signed-off-by: José Martínez <xosemp@gmail.com>
2018-07-25 09:20:58 +01:00
José Martínez 5e4a33c890 discovery/ec2: Maintain order of subnet_id label
Signed-off-by: José Martínez <xosemp@gmail.com>
2018-07-25 09:20:58 +01:00
Jannick Fahlbusch ฏ๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎ 0be25f92e2 EC2 Discovery: Allow to set a custom endpoint (#4333)
Allowing to set a custom endpoint makes it easy to monitor targets on non AWS providers with EC2 compliant APIs.

Signed-off-by: Jannick Fahlbusch <git@jf-projects.de>
2018-07-18 10:48:14 +01:00
Ivan Voronchihin 59d214d277 Update autorest vedoring (#4147)
Signed-off-by: bege13mot <bege13mot@gmail.com>
2018-07-18 05:24:15 +01:00
Julius Volz 219e477272 Fix some (valid) lint errors (#4287)
Signed-off-by: Julius Volz <julius.volz@gmail.com>
2018-07-18 05:07:33 +01:00
Romain Baugue b41be4ef52 Discovery consul service meta (#4280)
* Upgrade Consul client
* Add ServiceMeta to the labels in ConsulSD

Signed-off-by: Romain Baugue <romain.baugue@elwinar.com>
2018-07-18 05:06:56 +01:00
Simon Pasquier f32acc0b7b discovery/openstack: remove unneeded assignment
Signed-off-by: Simon Pasquier <spasquie@redhat.com>
2018-07-15 12:37:57 +01:00
Julius Volz 05d6d6a2e5
k8s SD: Fix "schema" -> "scheme" typo (#4371)
Signed-off-by: Julius Volz <julius.volz@gmail.com>
2018-07-12 16:12:32 +02:00
Krasi Georgiev a155b6d29d fix the zookeper race (#4355)
Signed-off-by: Krasi Georgiev <kgeorgie@redhat.com>
2018-07-06 08:39:38 +01:00
Dmitry Bashkatov 72327d98fb discovery/kubernetes/ingress: remove unnecessary check
Signed-off-by: Dmitry Bashkatov <dbashkatov@gmail.com>
2018-07-04 15:47:11 +03:00
Dmitry Bashkatov e2baf89eac discovery/kubernetes/ingress: fix scheme discovery (Closes #4327)
Signed-off-by: Dmitry Bashkatov <dbashkatov@gmail.com>
2018-07-04 13:28:44 +03:00
Dmitry Bashkatov 9cdca50bdd discovery/kubernetes/ingress: add more tests
Signed-off-by: Dmitry Bashkatov <dbashkatov@gmail.com>
2018-07-04 13:28:44 +03:00
Julius Volz 5cf0113762
Add "omitempty" to some SD config YAML field tags (#4338)
Especially for Kubernetes SD, this fixes a bug where the rendered
configuration says "api_server: null", which when read back is not
interpreted as an un-set API server (thus the default is not applied).

Signed-off-by: Julius Volz <julius.volz@gmail.com>
2018-07-03 13:43:41 +02:00
Simon Pasquier 6eab4bbca1 kubernetes_sd: fix namespace filtering (#4273)
Signed-off-by: Simon Pasquier <spasquie@redhat.com>
2018-06-15 09:08:14 +01:00
Paul Gier d24d2acd11 config: set target group source index during unmarshalling (#4245)
* config: set target group source index during unmarshalling

Fixes issue #4214 where the scrape pool is unnecessarily reloaded for a
config reload where the config hasn't changed.  Previously, the discovery
manager changed the static config after loading which caused the in-memory
config to differ from a freshly reloaded config.

Signed-off-by: Paul Gier <pgier@redhat.com>

* [issue #4214] Test that static targets are not modified by discovery manager

Signed-off-by: Paul Gier <pgier@redhat.com>
2018-06-13 16:34:59 +01:00
Simon Pasquier 0e5e7f75cd discovery/file: fix logging (#4178)
Signed-off-by: Simon Pasquier <spasquie@redhat.com>
2018-06-12 12:45:59 +01:00
Callum Styan 03578d5df8 add example usage of SD adapter for converting unsupported SD type to filesd (#3720)
Signed-off-by: Callum Styan <callumstyan@gmail.com>
2018-05-30 13:14:34 +01:00
Adam Shannon a22e1736b9 discovery/marathon: include url in fetchApps error (#4171)
This was previously part of a larger PR, but that was closed.

https://github.com/prometheus/prometheus/issues/4048#issuecomment-389899997

This change could include auth information in the URL. That's been
fixed in upstream go, but not until Go 1.11. See: https://github.com/golang/go/issues/24572

Signed-off-by: Adam Shannon <adamkshannon@gmail.com>
2018-05-18 10:20:14 +01:00
Damien Lespiau e64037053d Expose controller kind and name to labelling rules
Relabelling rules can use this information to attach the name of the controller
that has created a pod.

In turn, this can be used to slice metrics by workload at query time, ie.
"Give me all metrics that have been created by the $name Deployment"

Signed-off-by: Damien Lespiau <damien@weave.works>
2018-05-09 11:51:37 +02:00
Nathan Graves 5b27996cb3 Include GCE labels during service discovery. Updated vendor files for Google API. (#4150)
Signed-off-by: Nathan Graves <nathan.graves@kofile.us>
2018-05-08 17:37:47 +01:00
beorn7 a4e4bec3fe Merge branch 'release-2.2' 2018-04-30 14:38:29 +02:00
Elif T. Kuş 57dcdfb15f Rewrote tests with testutil for several test files (#4086)
* promql: Rewrote tests with testutil for functions_test

Signed-off-by: Elif T. Kuş <elifkus@gmail.com>

* pkg/relabel: Rewrote tests with testutil for relabel_test

Signed-off-by: Elif T. Kuş <elifkus@gmail.com>

* discovery/consul: Rewrote tests with testutil for consul_test

Signed-off-by: Elif T. Kuş <elifkus@gmail.com>

* scrape: Rewrote tests with testutil for manager_test

Signed-off-by: Elif T. Kuş <elifkus@gmail.com>
2018-04-27 13:11:16 +01:00
Yecheng Fu 2be543e65a Simplify some code and comments.
Signed-off-by: Yecheng Fu <cofyc.jackson@gmail.com>
2018-04-25 19:29:34 +02:00
Yecheng Fu 46683dd67d Simplify code.
- Unified `send` function.
- Pass InformerSynced functions to `cache.WaitForCacheSync`.
- Use `Role\w+` constants instead of literal string.

Signed-off-by: Yecheng Fu <cofyc.jackson@gmail.com>
2018-04-25 19:29:21 +02:00
Yecheng Fu 3a253f796c Fix grammar in comments and add missing expectedMaxItems to let it
break fast.

Signed-off-by: Yecheng Fu <cofyc.jackson@gmail.com>
2018-04-25 19:29:03 +02:00
Yecheng Fu d73b0d3141 Move hasSynced interface and its implementations to *_test.go files.
Signed-off-by: Yecheng Fu <cofyc.jackson@gmail.com>
2018-04-25 19:28:49 +02:00
Yecheng Fu 8ceb8f2ae8 Refactor Kubernetes Discovery Part 2: Refactoring
- Do initial listing and syncing to scrape manager, then register event
  handlers may lost events happening in listing and syncing (if it
  lasted a long time). We should register event handlers at the very
  begining, before processing just wait until informers synced (sync in
  informer will list all objects and call OnUpdate event handler).
- Use a queue then we don't block event callbacks and an object will be
  processed only once if added multiple times before it being processed.
- Fix bug in `serviceUpdate` in endpoints.go, we should build endpoints
  when `exists && err == nil`. Add `^TestEndpointsDiscoveryWithService`
  tests to test this feature.

Testing:

- Use `k8s.io/client-go` testing framework and fake implementations which are
  more robust and reliable for testing.
- `Test\w+DiscoveryBeforeRun` are used to test objects created before
  discoverer runs
- `Test\w+DiscoveryAdd\w+` are used to test adding objects
- `Test\w+DiscoveryDelete\w+` are used to test deleting objects
- `Test\w+DiscoveryUpdate\w+` are used to test updating objects
- `TestEndpointsDiscoveryWithService\w+` are used to test endpoints
  events triggered by services
- `cache.DeletedFinalStateUnknown` related stuffs are removed, because
  we don't care deleted objects in store, we only need its name to send
  a specical `targetgroup.Group` to scrape manager

Signed-off-by: Yecheng Fu <cofyc.jackson@gmail.com>
2018-04-25 19:28:34 +02:00
Adam Shannon 809881d7f5 support reading basic_auth password_file for HTTP basic auth (#4077)
Issue: https://github.com/prometheus/prometheus/issues/4076

Signed-off-by: Adam Shannon <adamkshannon@gmail.com>
2018-04-25 18:19:06 +01:00
Rohit Gupta 30c3e02864 Fixes #4090. Marathon service discovery for 5XX http response (#4091)
Signed-off-by: rohit01 <hello@rohit.io>
2018-04-17 09:28:06 +01:00
sev3ryn cc917aee7f fix of endless loop while doing Consul service discovery. (#4044)
Reloading Prometheus configs doesn't make loop end.
It produced a goroutine leak
2018-04-05 10:41:09 +01:00
Philippe Laflamme 2aba238f31 Use common HTTPClientConfig for marathon_sd configuration (#4009)
This adds support for basic authentication which closes #3090

The support for specifying the client timeout was removed as discussed in https://github.com/prometheus/common/pull/123. Marathon was the only sd mechanism doing this and configuring the timeout is done through `Context`.

DC/OS uses a custom `Authorization` header for authenticating. This adds 2 new configuration properties to reflect this.

Existing configuration files that use the bearer token will no longer work. More work is required to make this backwards compatible.
2018-04-05 09:08:18 +01:00
Manos Fokas 25f929b772 Yaml UnmarshalStrict implementation. (#4033)
* Updated yaml vendor package.

* remove checkOverflow duplicate in rulefmt

* remove duplicated HTTPClientConfig.Validate()

* Added yaml static check.
2018-04-04 09:07:39 +01:00
albatross0 0245fd55bf Add a machine type label to GCE SD (#4032) 2018-03-31 09:20:19 +01:00
Kristiyan Nikolov be85ba3842 discovery/ec2: Support filtering instances in discovery (#4011) 2018-03-31 07:51:11 +01:00
Corentin Chary 60dafd425c consul: improve consul service discovery (#3814)
* consul: improve consul service discovery

Related to #3711

- Add the ability to filter by tag and node-meta in an efficient way (`/catalog/services`
  allow filtering by node-meta, and returns a `map[string]string` or `service`->`tags`).
  Tags and nore-meta are also used in `/catalog/service` requests.
- Do not require a call to the catalog if services are specified by name. This is important
  because on large cluster `/catalog/services` changes all the time.
- Add `allow_stale` configuration option to do stale reads. Non-stale
  reads can be costly, even more when you are doing them to a remote
  datacenter with 10k+ targets over WAN (which is common for federation).
- Add `refresh_interval` to minimize the strain on the catalog and on the
  service endpoint. This is needed because of that kind of behavior from
  consul: https://github.com/hashicorp/consul/issues/3712 and because a catalog
  on a large cluster would basically change *all* the time. No need to discover
  targets in 1sec if we scrape them every minute.
- Added plenty of unit tests.

Benchmarks
----------

```yaml
scrape_configs:

- job_name: prometheus
  scrape_interval: 60s
  static_configs:
    - targets: ["127.0.0.1:9090"]

- job_name: "observability-by-tag"
  scrape_interval: "60s"
  metrics_path: "/metrics"
  consul_sd_configs:
    - server: consul.service.par.consul.prod.crto.in:8500
      tag: marathon-user-observability  # Used in After
      refresh_interval: 30s             # Used in After+delay
  relabel_configs:
    - source_labels: [__meta_consul_tags]
      regex: ^(.*,)?marathon-user-observability(,.*)?$
      action: keep

- job_name: "observability-by-name"
  scrape_interval: "60s"
  metrics_path: "/metrics"
  consul_sd_configs:
    - server: consul.service.par.consul.prod.crto.in:8500
      services:
        - observability-cerebro
        - observability-portal-web

- job_name: "fake-fake-fake"
  scrape_interval: "15s"
  metrics_path: "/metrics"
  consul_sd_configs:
    - server: consul.service.par.consul.prod.crto.in:8500
      services:
        - fake-fake-fake
```

Note: tested with ~1200 services, ~5000 nodes.

| Resource | Empty | Before | After | After + delay |
| -------- |:-----:|:------:|:-----:|:-------------:|
|/service-discovery size|5K|85MiB|27k|27k|27k|
|`go_memstats_heap_objects`|100k|1M|120k|110k|
|`go_memstats_heap_alloc_bytes`|24MB|150MB|28MB|27MB|
|`rate(go_memstats_alloc_bytes_total[5m])`|0.2MB/s|28MB/s|2MB/s|0.3MB/s|
|`rate(process_cpu_seconds_total[5m])`|0.1%|15%|2%|0.01%|
|`process_open_fds`|16|*1236*|22|22|
|`rate(prometheus_sd_consul_rpc_duration_seconds_count{call="services"}[5m])`|~0|1|1|*0.03*|
|`rate(prometheus_sd_consul_rpc_duration_seconds_count{call="service"}[5m])`|0.1|*80*|0.5|0.5|
|`prometheus_target_sync_length_seconds{quantile="0.9",scrape_job="observability-by-tag"}`|N/A|200ms|0.2ms|0.2ms|
|Network bandwidth|~10kbps|~2.8Mbps|~1.6Mbps|~10kbps|

Filtering by tag using relabel_configs uses **100kiB and 23kiB/s per service per job** and quite a lot of CPU. Also sends and additional *1Mbps* of traffic to consul.
Being a little bit smarter about this reduces the overhead quite a lot.
Limiting the number of `/catalog/services` queries per second almost removes the overhead of service discovery.

* consul: tweak `refresh_interval` behavior

`refresh_interval` now does what is advertised in the documentation,
there won't be more that one update per `refresh_interval`. It now
defaults to 30s (which was also the current waitTime in the consul query).

This also make sure we don't wait another 30s if we already waited 29s
in the blocking call by substracting the number of elapsed seconds.

Hopefully this will do what people expect it does and will be safer
for existing consul infrastructures.
2018-03-23 14:48:43 +00:00
Ben Kochie 0d9fe18f5e Fix nil context staticcheck error. 2018-03-22 07:59:39 +00:00
Aaron Kirkbride c47fbcb626 Fix moved fsnotify dependency (#3995) 2018-03-21 15:46:31 +00:00
Jeeyoung Kim 5b962c5748 Revert "Feature: Allow getting credentials via EC2 role (#3343)" (#3985)
This reverts commit 808f79f00a.
2018-03-20 12:34:54 +00:00
Matt Palmer 042090a6d3 [dns_sd] Send an EDNS0 query by default (#3586)
Based on https://groups.google.com/d/topic/prometheus-users/02kezHbuea4/discussion

Does not attempt to handle a situation where the server does not understand
EDNS0, however that is an unlikely case, and the behaviour of such ancient
systems is hard to predict in advance, so if it does come up, it will need
to be handled on a case-by-case basis.
2018-03-09 10:21:58 +00:00