Commit graph

321 commits

Author SHA1 Message Date
Simon Pasquier a30348f1a4 discovery: add config label to discovered targets metric (#4753)
* discovery: add labels to discovered targets metric

Signed-off-by: Simon Pasquier <spasquie@redhat.com>
2018-10-18 16:46:59 +01:00
Simon Pasquier 5824d6902d
openstack: fix client when using env variables (#4734)
Signed-off-by: Simon Pasquier <spasquie@redhat.com>
2018-10-17 16:04:07 +02:00
Kien Nguyen-Tuan 9c5370fdfe Support discover instances from all projects (#4682)
By default, OpenStack SD only queries for instances
from specified project. To discover instances from other
projects, users have to add more openstack_sd_configs for
each project.

This patch adds `all_tenants` <bool> options to
openstack_sd_configs. For example:

- job_name: 'openstack_all_instances'
  openstack_sd_configs:
    - role: instance
      region: RegionOne
      identity_endpoint: http://<identity_server>/identity/v3
      username: <username>
      password: <super_secret_password>
      domain_name: Default
      all_tenants: true

Co-authored-by: Kien Nguyen <kiennt2609@gmail.com>
Signed-off-by: dmatosl <danielmatos.lima@gmail.com>
2018-10-17 13:01:33 +01:00
Simon Pasquier c4a6acfb1e
*: move to go 1.11 (#4626)
* *: move to go 1.11

Signed-off-by: Simon Pasquier <spasquie@redhat.com>

* Reduce number of places where we specify the Go version

Signed-off-by: Simon Pasquier <spasquie@redhat.com>
2018-10-16 09:41:45 +02:00
Goutham Veeramachaneni ffb7f829ec
Merge pull request #4730 from prometheus/release-2.4
Release 2.4
2018-10-12 14:15:42 -07:00
Simon Pasquier 3e6b9d43c3
Merge pull request #4720 from teresy/redundant-nil-check-slice
Remove redundant nil check
2018-10-11 10:24:55 +02:00
Rijnard van Tonder 9d102e3bff The nil check before the range loop is redundant
Signed-off-by: Rijnard van Tonder <hi.teresy@gmail.com>
2018-10-10 16:11:45 -04:00
Richard Kiene b537f6047a Add ability to filter triton_sd targets by pre-defined groups (#4701)
Additionally, add triton groups metadata to the discovery reponse
and correct a documentation error regarding the triton server id
metadata.

Signed-off-by: Richard Kiene <richard.kiene@joyent.com>
2018-10-10 10:03:34 +01:00
Simon Pasquier a2a78d0a09 discovery/openstack: discover all interfaces (#4649)
* discovery/openstack: discover all interfaces
* Add address pool label

Signed-off-by: Simon Pasquier <spasquie@redhat.com>
2018-10-09 16:17:08 +01:00
Simon Pasquier e1e2821cca
Merge pull request #4654 from simonpasquier/openstack-tls
discovery/openstack: support tls_config
2018-10-05 18:11:55 +02:00
Jannick Fahlbusch ฏ๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎ f78e59577b [FIX] EC2 DS: Check for existence of OwnerID (#4672)
Commit 1c89984 introduced the ability to expose the owner of the instance.
However, this breaks Prometheus if there is no OwnerID in the reservation (Eg. if you are using a private EC2-API introduced by #4333)

Signed-off-by: Jannick Fahlbusch <git@jf-projects.de>
2018-10-02 16:18:31 +05:30
Simon Pasquier 657199af22 Address Krasi comments
Signed-off-by: Simon Pasquier <spasquie@redhat.com>
2018-09-28 12:29:24 +02:00
Simon Pasquier 5df757fdd4 zookeeper: fix panic
Signed-off-by: Simon Pasquier <spasquie@redhat.com>
2018-09-28 11:39:40 +02:00
Simon Pasquier 365931ea83 discovery: add metrics + send updates from one goroutine only
The added metrics are:

* prometheus_sd_discovered_targets
* prometheus_sd_received_updates_total
* prometheus_sd_updates_delayed_total
* prometheus_sd_updates_total

Signed-off-by: Simon Pasquier <spasquie@redhat.com>
2018-09-27 15:59:42 +02:00
Simon Pasquier f2d43af820
Merge pull request #4582 from simonpasquier/add-discovery-tests
discovery: add more tests
2018-09-27 15:18:42 +02:00
Simon Pasquier ff08c40091 discovery/openstack: support tls_config
Signed-off-by: Simon Pasquier <spasquie@redhat.com>
2018-09-25 14:31:32 +02:00
Frederic Branczyk b75ec7e6ef
Merge pull request #4458 from FUSAKLA/k8s-sd-add-metrics
feat: added more k8s SD metrics
2018-09-21 13:10:48 +02:00
Timo Beckers 1c9fbd65c4 marathon-sd - change port gathering strategy, support for container networking (#4499)
* marathon-sd - change port gathering strategy, add support for container networking

- removed unnecessary error check on HTTPClientConfig.Validate()
- renamed PortDefinitions and PortMappings to PortDefinition and PortMapping respectively
- extended data model for extra parsed fields from Marathon json
- support container networking on Marathon 1.5+ (target Task.IPAddresses.x.Address)
- expanded test suite to cover all new cases
- test: cancel context when reading from doneCh before returning from function
- test: split test suite into Ports/PortMappings/PortDefinitions

Signed-off-by: Timo Beckers <timo@incline.eu>
2018-09-21 11:53:04 +01:00
Martin Chodur f2d037133e
feat: added more k8s SD metrics
Signed-off-by: Martin Chodur <m.chodur@seznam.cz>
2018-09-20 22:28:51 +02:00
Camille Janicki b035ea0ea9 Change discovery subpackages to not use testify in tests (#4612)
* Change discovery subpackages to not use testify in tests

Signed-off-by: Camille Janicki <camille.janicki@gmail.com>

* Remove testify suite from vendor dir

Signed-off-by: Camille Janicki <camille.janicki@gmail.com>
2018-09-18 17:35:22 +02:00
Simon Pasquier 128ff546b8 config: add test for OpenStack SD (#4594)
Signed-off-by: Simon Pasquier <spasquie@redhat.com>
2018-09-13 21:44:27 +05:30
Tom Wilkie e3d36f4802 Don't import testing from non-test code. (#4595)
Signed-off-by: Tom Wilkie <tom.wilkie@gmail.com>
2018-09-13 16:03:26 +05:30
Bryan Boreham 968f657eaa Stop removing the final dot from rooted DNS names (#4586)
Removing a final dot changes the meaning of the name and can cause
extra DNS lookups as the resolver traverses its search path.

Signed-off-by: Bryan Boreham <bjboreham@gmail.com>
2018-09-13 15:28:38 +05:30
Simon Pasquier e7cee1b5ba Remove tests redundant with TestTargetUpdatesOrder
Signed-off-by: Simon Pasquier <spasquie@redhat.com>
2018-09-12 17:56:53 +02:00
Simon Pasquier 7dc3f11306 WIP discovery: refactor TestTargetUpdatesOrder
Signed-off-by: Simon Pasquier <spasquie@redhat.com>
2018-09-12 16:15:03 +02:00
Simon Pasquier 8fd891bf3f Speed up tests that were still using the 5s timeout
Signed-off-by: Simon Pasquier <spasquie@redhat.com>
2018-09-12 16:13:15 +02:00
Simon Pasquier 8289501420 Address krasi's comments
Signed-off-by: Simon Pasquier <spasquie@redhat.com>
2018-09-12 16:13:15 +02:00
Simon Pasquier 1cee5b5b06 Don't multiple the interval value by 1ms in the mock
Signed-off-by: Simon Pasquier <spasquie@redhat.com>
2018-09-12 16:13:15 +02:00
Simon Pasquier 4900405d2f Refactor TestCoordinationWithReceiver() to work with any Discoverer
Signed-off-by: Simon Pasquier <spasquie@redhat.com>
2018-09-12 16:13:15 +02:00
Simon Pasquier 0798f14e02 Add TestCoordinationWithEmptyProvider
Signed-off-by: Simon Pasquier <spasquie@redhat.com>
2018-09-12 16:13:15 +02:00
Simon Pasquier 48989d8996 discovery: add more tests
Co-authored-by: Camille Janicki <camille.janicki@gmail.com>
Signed-off-by: Simon Pasquier <spasquie@redhat.com>
2018-09-12 16:13:15 +02:00
Krasi Georgiev ba7eb733e8 tidy up the discovery logs,updating loops and selects (#4556)
* tidy up the discovery logs,updating loops and selects

few objects renamings

removed a very noise debug log on the k8s discovery. It would be usefull
to show some summary rather than every update as this is impossible to
follow.

added most comments as debug logs so each block becomes self
explanatory.

when the discovery receiving channel is full will retry again on the
next cycle.

Signed-off-by: Krasi Georgiev <kgeorgie@redhat.com>

* add noop logger for the SD manager tests.

Signed-off-by: Krasi Georgiev <kgeorgie@redhat.com>

* spelling nits

Signed-off-by: Krasi Georgiev <kgeorgie@redhat.com>
2018-09-05 17:02:47 +05:30
Tariq Ibrahim f708fd5c99 Adding support for multiple azure environments (#4569)
Signed-off-by: Tariq Ibrahim <tariq.ibrahim@microsoft.com>
2018-09-04 17:55:40 +02:00
Simon Pasquier 674c76adb8 discovery: coalesce identical SD configurations (#3912)
* discovery: coalesce identical SD configurations

Instead of creating as many SD providers as declared in the
configuration, the discovery manager merges identical configurations
into the same provider and keeps track of the subscribers. When
the manager receives target updates from a SD provider, it will
broadcast the updates to all interested subscribers.

Signed-off-by: Simon Pasquier <spasquie@redhat.com>
2018-09-01 08:51:31 +01:00
Krasi Georgiev 53691ae261 Simplify SD update throttling (#4523)
* simplfied SD updates throtling

Signed-off-by: Krasi Georgiev <kgeorgie@redhat.com>

* add default to catch cases when we don't have new updates.

Signed-off-by: Krasi Georgiev <kgeorgie@redhat.com>
2018-08-27 17:12:11 +02:00
Fabian Reinartz f571b69010
Merge pull request #4514 from jkohen/ec2-targets
Expose EC2 instance owner as a discovery label.
2018-08-20 08:43:44 +02:00
Javier Kohen 1c89984778 Expose EC2 instance owner as a discovery label.
This exposes the OwnerID field of the DescribeInstances respons as .

Signed-off-by: Javier Kohen <jkohen@google.com>
2018-08-17 11:30:18 -04:00
Yecheng Fu d4eae8cc0c Wait for all internal discoveries are done before exiting. (#4508)
Signed-off-by: Yecheng Fu <cofyc.jackson@gmail.com>
2018-08-17 18:50:22 +05:30
Fabian Reinartz b04ab71268
Merge pull request #4488 from jkohen/patch-3
Populate __meta_gce_instance_id discovery label
2018-08-11 09:52:28 +02:00
Javier Kohen 403ac08ece Expose __meta_gce_instance_id as an integer (instead of raw bytes).
Signed-off-by: Javier Kohen <jkohen@google.com>
2018-08-10 16:21:46 -04:00
Javier Kohen 7e9549b398 Added __meta_gce_instance_id discovery label
Populated from instance.ID. I will follow up with a change to the documentation.

Signed-off-by: Javier Kohen <jkohen@google.com>
2018-08-10 11:57:55 -04:00
Simon Pasquier b7054f3a78
Merge pull request #4443 from simonpasquier/fix-consul-connections-leak
discovery/consul: close idle connections on stop
2018-08-10 17:43:39 +02:00
Benji Visser 46fb4078a6 handle nil pointer in ec2 discovery (#4469)
This handles a nil pointer that was being accessed in EC2 discovery.

Fixes: #4441

Signed-off-by: noqcks <benny@noqcks.io>
2018-08-07 08:35:22 +01:00
Johannes Scheuermann f978f5bba3 Fixes #4202, correctly parse VMs with empty tags (#4450)
Signed-off-by: Johannes M. Scheuermann <joh.scheuer@gmail.com>
2018-08-02 10:10:17 +01:00
jojohappy e060f7755f To keep comment of NodeLegacyHostIP for k8s node address
Signed-off-by: jojohappy <sarahdj0917@gmail.com>
2018-08-02 10:25:28 +08:00
jojohappy e81785d1a3 To keep depecrate k8s node NodeLegacyHostIP as local constant to keep compatibility for older k8s version
Signed-off-by: jojohappy <sarahdj0917@gmail.com>
2018-08-02 10:25:28 +08:00
jojohappy 21e50a3f9d Upgrade k8s client to kubernetes-1.11.0
Signed-off-by: jojohappy <sarahdj0917@gmail.com>
2018-08-02 10:25:27 +08:00
Simon Pasquier 1cd29f782c discovery/consul: close idle connections on stop
Signed-off-by: Simon Pasquier <spasquie@redhat.com>
2018-08-01 17:26:52 +02:00
Johannes Scheuermann 7608ee87d0 Inital support for Azure VMSS (#4202)
* Inital support for Azure VMSS

Signed-off-by: Johannes Scheuermann <johannes.scheuermann@inovex.de>

* Add documentation for the newly introduced label

Signed-off-by: Johannes M. Scheuermann <joh.scheuer@gmail.com>
2018-08-01 12:52:21 +01:00
José Martínez 791c13b142 discovery/ec2: Add primary_subnet_id label
Signed-off-by: José Martínez <xosemp@gmail.com>
2018-07-25 09:20:58 +01:00
José Martínez 5e4a33c890 discovery/ec2: Maintain order of subnet_id label
Signed-off-by: José Martínez <xosemp@gmail.com>
2018-07-25 09:20:58 +01:00
Jannick Fahlbusch ฏ๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎ 0be25f92e2 EC2 Discovery: Allow to set a custom endpoint (#4333)
Allowing to set a custom endpoint makes it easy to monitor targets on non AWS providers with EC2 compliant APIs.

Signed-off-by: Jannick Fahlbusch <git@jf-projects.de>
2018-07-18 10:48:14 +01:00
Ivan Voronchihin 59d214d277 Update autorest vedoring (#4147)
Signed-off-by: bege13mot <bege13mot@gmail.com>
2018-07-18 05:24:15 +01:00
Julius Volz 219e477272 Fix some (valid) lint errors (#4287)
Signed-off-by: Julius Volz <julius.volz@gmail.com>
2018-07-18 05:07:33 +01:00
Romain Baugue b41be4ef52 Discovery consul service meta (#4280)
* Upgrade Consul client
* Add ServiceMeta to the labels in ConsulSD

Signed-off-by: Romain Baugue <romain.baugue@elwinar.com>
2018-07-18 05:06:56 +01:00
Simon Pasquier f32acc0b7b discovery/openstack: remove unneeded assignment
Signed-off-by: Simon Pasquier <spasquie@redhat.com>
2018-07-15 12:37:57 +01:00
Julius Volz 05d6d6a2e5
k8s SD: Fix "schema" -> "scheme" typo (#4371)
Signed-off-by: Julius Volz <julius.volz@gmail.com>
2018-07-12 16:12:32 +02:00
Krasi Georgiev a155b6d29d fix the zookeper race (#4355)
Signed-off-by: Krasi Georgiev <kgeorgie@redhat.com>
2018-07-06 08:39:38 +01:00
Dmitry Bashkatov 72327d98fb discovery/kubernetes/ingress: remove unnecessary check
Signed-off-by: Dmitry Bashkatov <dbashkatov@gmail.com>
2018-07-04 15:47:11 +03:00
Dmitry Bashkatov e2baf89eac discovery/kubernetes/ingress: fix scheme discovery (Closes #4327)
Signed-off-by: Dmitry Bashkatov <dbashkatov@gmail.com>
2018-07-04 13:28:44 +03:00
Dmitry Bashkatov 9cdca50bdd discovery/kubernetes/ingress: add more tests
Signed-off-by: Dmitry Bashkatov <dbashkatov@gmail.com>
2018-07-04 13:28:44 +03:00
Julius Volz 5cf0113762
Add "omitempty" to some SD config YAML field tags (#4338)
Especially for Kubernetes SD, this fixes a bug where the rendered
configuration says "api_server: null", which when read back is not
interpreted as an un-set API server (thus the default is not applied).

Signed-off-by: Julius Volz <julius.volz@gmail.com>
2018-07-03 13:43:41 +02:00
Simon Pasquier 6eab4bbca1 kubernetes_sd: fix namespace filtering (#4273)
Signed-off-by: Simon Pasquier <spasquie@redhat.com>
2018-06-15 09:08:14 +01:00
Paul Gier d24d2acd11 config: set target group source index during unmarshalling (#4245)
* config: set target group source index during unmarshalling

Fixes issue #4214 where the scrape pool is unnecessarily reloaded for a
config reload where the config hasn't changed.  Previously, the discovery
manager changed the static config after loading which caused the in-memory
config to differ from a freshly reloaded config.

Signed-off-by: Paul Gier <pgier@redhat.com>

* [issue #4214] Test that static targets are not modified by discovery manager

Signed-off-by: Paul Gier <pgier@redhat.com>
2018-06-13 16:34:59 +01:00
Simon Pasquier 0e5e7f75cd discovery/file: fix logging (#4178)
Signed-off-by: Simon Pasquier <spasquie@redhat.com>
2018-06-12 12:45:59 +01:00
Callum Styan 03578d5df8 add example usage of SD adapter for converting unsupported SD type to filesd (#3720)
Signed-off-by: Callum Styan <callumstyan@gmail.com>
2018-05-30 13:14:34 +01:00
Adam Shannon a22e1736b9 discovery/marathon: include url in fetchApps error (#4171)
This was previously part of a larger PR, but that was closed.

https://github.com/prometheus/prometheus/issues/4048#issuecomment-389899997

This change could include auth information in the URL. That's been
fixed in upstream go, but not until Go 1.11. See: https://github.com/golang/go/issues/24572

Signed-off-by: Adam Shannon <adamkshannon@gmail.com>
2018-05-18 10:20:14 +01:00
Damien Lespiau e64037053d Expose controller kind and name to labelling rules
Relabelling rules can use this information to attach the name of the controller
that has created a pod.

In turn, this can be used to slice metrics by workload at query time, ie.
"Give me all metrics that have been created by the $name Deployment"

Signed-off-by: Damien Lespiau <damien@weave.works>
2018-05-09 11:51:37 +02:00
Nathan Graves 5b27996cb3 Include GCE labels during service discovery. Updated vendor files for Google API. (#4150)
Signed-off-by: Nathan Graves <nathan.graves@kofile.us>
2018-05-08 17:37:47 +01:00
beorn7 a4e4bec3fe Merge branch 'release-2.2' 2018-04-30 14:38:29 +02:00
Elif T. Kuş 57dcdfb15f Rewrote tests with testutil for several test files (#4086)
* promql: Rewrote tests with testutil for functions_test

Signed-off-by: Elif T. Kuş <elifkus@gmail.com>

* pkg/relabel: Rewrote tests with testutil for relabel_test

Signed-off-by: Elif T. Kuş <elifkus@gmail.com>

* discovery/consul: Rewrote tests with testutil for consul_test

Signed-off-by: Elif T. Kuş <elifkus@gmail.com>

* scrape: Rewrote tests with testutil for manager_test

Signed-off-by: Elif T. Kuş <elifkus@gmail.com>
2018-04-27 13:11:16 +01:00
Yecheng Fu 2be543e65a Simplify some code and comments.
Signed-off-by: Yecheng Fu <cofyc.jackson@gmail.com>
2018-04-25 19:29:34 +02:00
Yecheng Fu 46683dd67d Simplify code.
- Unified `send` function.
- Pass InformerSynced functions to `cache.WaitForCacheSync`.
- Use `Role\w+` constants instead of literal string.

Signed-off-by: Yecheng Fu <cofyc.jackson@gmail.com>
2018-04-25 19:29:21 +02:00
Yecheng Fu 3a253f796c Fix grammar in comments and add missing expectedMaxItems to let it
break fast.

Signed-off-by: Yecheng Fu <cofyc.jackson@gmail.com>
2018-04-25 19:29:03 +02:00
Yecheng Fu d73b0d3141 Move hasSynced interface and its implementations to *_test.go files.
Signed-off-by: Yecheng Fu <cofyc.jackson@gmail.com>
2018-04-25 19:28:49 +02:00
Yecheng Fu 8ceb8f2ae8 Refactor Kubernetes Discovery Part 2: Refactoring
- Do initial listing and syncing to scrape manager, then register event
  handlers may lost events happening in listing and syncing (if it
  lasted a long time). We should register event handlers at the very
  begining, before processing just wait until informers synced (sync in
  informer will list all objects and call OnUpdate event handler).
- Use a queue then we don't block event callbacks and an object will be
  processed only once if added multiple times before it being processed.
- Fix bug in `serviceUpdate` in endpoints.go, we should build endpoints
  when `exists && err == nil`. Add `^TestEndpointsDiscoveryWithService`
  tests to test this feature.

Testing:

- Use `k8s.io/client-go` testing framework and fake implementations which are
  more robust and reliable for testing.
- `Test\w+DiscoveryBeforeRun` are used to test objects created before
  discoverer runs
- `Test\w+DiscoveryAdd\w+` are used to test adding objects
- `Test\w+DiscoveryDelete\w+` are used to test deleting objects
- `Test\w+DiscoveryUpdate\w+` are used to test updating objects
- `TestEndpointsDiscoveryWithService\w+` are used to test endpoints
  events triggered by services
- `cache.DeletedFinalStateUnknown` related stuffs are removed, because
  we don't care deleted objects in store, we only need its name to send
  a specical `targetgroup.Group` to scrape manager

Signed-off-by: Yecheng Fu <cofyc.jackson@gmail.com>
2018-04-25 19:28:34 +02:00
Adam Shannon 809881d7f5 support reading basic_auth password_file for HTTP basic auth (#4077)
Issue: https://github.com/prometheus/prometheus/issues/4076

Signed-off-by: Adam Shannon <adamkshannon@gmail.com>
2018-04-25 18:19:06 +01:00
Rohit Gupta 30c3e02864 Fixes #4090. Marathon service discovery for 5XX http response (#4091)
Signed-off-by: rohit01 <hello@rohit.io>
2018-04-17 09:28:06 +01:00
sev3ryn cc917aee7f fix of endless loop while doing Consul service discovery. (#4044)
Reloading Prometheus configs doesn't make loop end.
It produced a goroutine leak
2018-04-05 10:41:09 +01:00
Philippe Laflamme 2aba238f31 Use common HTTPClientConfig for marathon_sd configuration (#4009)
This adds support for basic authentication which closes #3090

The support for specifying the client timeout was removed as discussed in https://github.com/prometheus/common/pull/123. Marathon was the only sd mechanism doing this and configuring the timeout is done through `Context`.

DC/OS uses a custom `Authorization` header for authenticating. This adds 2 new configuration properties to reflect this.

Existing configuration files that use the bearer token will no longer work. More work is required to make this backwards compatible.
2018-04-05 09:08:18 +01:00
Manos Fokas 25f929b772 Yaml UnmarshalStrict implementation. (#4033)
* Updated yaml vendor package.

* remove checkOverflow duplicate in rulefmt

* remove duplicated HTTPClientConfig.Validate()

* Added yaml static check.
2018-04-04 09:07:39 +01:00
albatross0 0245fd55bf Add a machine type label to GCE SD (#4032) 2018-03-31 09:20:19 +01:00
Kristiyan Nikolov be85ba3842 discovery/ec2: Support filtering instances in discovery (#4011) 2018-03-31 07:51:11 +01:00
Corentin Chary 60dafd425c consul: improve consul service discovery (#3814)
* consul: improve consul service discovery

Related to #3711

- Add the ability to filter by tag and node-meta in an efficient way (`/catalog/services`
  allow filtering by node-meta, and returns a `map[string]string` or `service`->`tags`).
  Tags and nore-meta are also used in `/catalog/service` requests.
- Do not require a call to the catalog if services are specified by name. This is important
  because on large cluster `/catalog/services` changes all the time.
- Add `allow_stale` configuration option to do stale reads. Non-stale
  reads can be costly, even more when you are doing them to a remote
  datacenter with 10k+ targets over WAN (which is common for federation).
- Add `refresh_interval` to minimize the strain on the catalog and on the
  service endpoint. This is needed because of that kind of behavior from
  consul: https://github.com/hashicorp/consul/issues/3712 and because a catalog
  on a large cluster would basically change *all* the time. No need to discover
  targets in 1sec if we scrape them every minute.
- Added plenty of unit tests.

Benchmarks
----------

```yaml
scrape_configs:

- job_name: prometheus
  scrape_interval: 60s
  static_configs:
    - targets: ["127.0.0.1:9090"]

- job_name: "observability-by-tag"
  scrape_interval: "60s"
  metrics_path: "/metrics"
  consul_sd_configs:
    - server: consul.service.par.consul.prod.crto.in:8500
      tag: marathon-user-observability  # Used in After
      refresh_interval: 30s             # Used in After+delay
  relabel_configs:
    - source_labels: [__meta_consul_tags]
      regex: ^(.*,)?marathon-user-observability(,.*)?$
      action: keep

- job_name: "observability-by-name"
  scrape_interval: "60s"
  metrics_path: "/metrics"
  consul_sd_configs:
    - server: consul.service.par.consul.prod.crto.in:8500
      services:
        - observability-cerebro
        - observability-portal-web

- job_name: "fake-fake-fake"
  scrape_interval: "15s"
  metrics_path: "/metrics"
  consul_sd_configs:
    - server: consul.service.par.consul.prod.crto.in:8500
      services:
        - fake-fake-fake
```

Note: tested with ~1200 services, ~5000 nodes.

| Resource | Empty | Before | After | After + delay |
| -------- |:-----:|:------:|:-----:|:-------------:|
|/service-discovery size|5K|85MiB|27k|27k|27k|
|`go_memstats_heap_objects`|100k|1M|120k|110k|
|`go_memstats_heap_alloc_bytes`|24MB|150MB|28MB|27MB|
|`rate(go_memstats_alloc_bytes_total[5m])`|0.2MB/s|28MB/s|2MB/s|0.3MB/s|
|`rate(process_cpu_seconds_total[5m])`|0.1%|15%|2%|0.01%|
|`process_open_fds`|16|*1236*|22|22|
|`rate(prometheus_sd_consul_rpc_duration_seconds_count{call="services"}[5m])`|~0|1|1|*0.03*|
|`rate(prometheus_sd_consul_rpc_duration_seconds_count{call="service"}[5m])`|0.1|*80*|0.5|0.5|
|`prometheus_target_sync_length_seconds{quantile="0.9",scrape_job="observability-by-tag"}`|N/A|200ms|0.2ms|0.2ms|
|Network bandwidth|~10kbps|~2.8Mbps|~1.6Mbps|~10kbps|

Filtering by tag using relabel_configs uses **100kiB and 23kiB/s per service per job** and quite a lot of CPU. Also sends and additional *1Mbps* of traffic to consul.
Being a little bit smarter about this reduces the overhead quite a lot.
Limiting the number of `/catalog/services` queries per second almost removes the overhead of service discovery.

* consul: tweak `refresh_interval` behavior

`refresh_interval` now does what is advertised in the documentation,
there won't be more that one update per `refresh_interval`. It now
defaults to 30s (which was also the current waitTime in the consul query).

This also make sure we don't wait another 30s if we already waited 29s
in the blocking call by substracting the number of elapsed seconds.

Hopefully this will do what people expect it does and will be safer
for existing consul infrastructures.
2018-03-23 14:48:43 +00:00
Ben Kochie 0d9fe18f5e Fix nil context staticcheck error. 2018-03-22 07:59:39 +00:00
Aaron Kirkbride c47fbcb626 Fix moved fsnotify dependency (#3995) 2018-03-21 15:46:31 +00:00
Jeeyoung Kim 5b962c5748 Revert "Feature: Allow getting credentials via EC2 role (#3343)" (#3985)
This reverts commit 808f79f00a.
2018-03-20 12:34:54 +00:00
Matt Palmer 042090a6d3 [dns_sd] Send an EDNS0 query by default (#3586)
Based on https://groups.google.com/d/topic/prometheus-users/02kezHbuea4/discussion

Does not attempt to handle a situation where the server does not understand
EDNS0, however that is an unlikely case, and the behaviour of such ancient
systems is hard to predict in advance, so if it does come up, it will need
to be handled on a case-by-case basis.
2018-03-09 10:21:58 +00:00
Yecheng Fu 56ed29fbf7 Map target infos of endpoints to prometheus meta labels. (#3770) 2018-03-09 10:07:00 +00:00
Marek Siarkowicz 86011047ca Validate required fields in sd configuration (#3911) 2018-03-05 19:27:54 +00:00
Krasi Georgiev 6b0e9ef183 Validate json parse for TargetGroup Unmarshal (#3614)
Using DisallowUnknownFields in golang 1.10 to forbid unknown fields in targetGroups
added the license header for the targetGroup test
2018-02-27 12:33:27 +00:00
Krasi Georgiev 4fa7e719f4 race in Triton SD Test (#3885) 2018-02-26 10:03:50 +00:00
ferhat elmas ffa673f7d8 General simplifications (#3887)
Another try as in #1516
2018-02-26 07:58:10 +00:00
Pedro Araújo 575f665944 Add OS type meta label to Azure SD (#3863)
There is currently no way to differentiate Windows instances from Linux
ones. This is needed when you have a mix of node_exporters /
wmi_exporters for OS-level metrics and you want to have them in separate
scrape jobs.

This change allows you to do just that. Example:

```
  - job_name: 'node'
    azure_sd_configs:
      - <azure_sd_config>
    relabel_configs:
      - source_labels: [__meta_azure_machine_os_type]
        regex: Linux
        action: keep
```

The way the vendor'd AzureSDK provides to get the OsType is a bit
awkward - as far as I can tell, this information can only be gotten from
the startup disk. Newer versions of the SDK appear to improve this a
bit (by having OS information in the InstanceView), but the current way
still works.
2018-02-19 15:40:57 +00:00
Simon Pasquier 2072bbc824 Send update when pod's IP address is empty
When the pod gets evicted, its IP address becomes empty and it needs to
be removed from the targets.
2018-02-14 14:23:52 +01:00
Krasi Georgiev b75428ec19 rename package retrieve to scrape
no fucnctinal changes just renaming retrieval to scrape
2018-02-01 09:55:07 +00:00
Frederic Branczyk d3ae1ac40e
Merge pull request #3741 from krasi-georgiev/discovery-race
read/write race for the  context field in the discovery package
2018-01-30 18:17:09 +01:00
pasquier-s bde64cf5a6 Fix Kubernetes endpoints SD for empty subsets (#3660)
* Fix Kubernetes endpoints SD for empty subsets

When an endpoints object has no associated pods (replica scaled to zero
for instance), the endpoints SD should return a target group with no
targets so that the SD manager propagates this information to the scrape
manager.

Fixes #3659

* Don't send nil target groups from the Kubernetes SD

This is to be consistent with the endpoints SD part.
2018-01-30 15:00:33 +00:00
Krasi Georgiev 818dda72db updated the sd tests 2018-01-29 15:19:15 +00:00
Krasi Georgiev acc4197098 remove dicovery race for the context field 2018-01-29 15:18:07 +00:00
Frederic Branczyk 47538cf6ce
Merge pull request #3747 from prometheus/sched-update-throttle
Update throttle & tsdb update
2018-01-29 16:05:05 +01:00
Frederic Branczyk 73e829137b
discovery: Cleanup ticker 2018-01-29 13:51:04 +01:00
Ganesh Vernekar 66b0aa3b45 Fixed race condition in map iteration and map write in Discovery (#3735) (#3738)
* Fixed concurrent map iteration and map write in Discovery (#3735)

* discovery: Changed Lock to RLock in Collect
2018-01-28 22:24:31 +05:30
Krasi Georgiev fe926e7829 update the discover tests
the discovery test is now only testing update and get groups.
It doesn't do an e2e test but just a unit test of setting and receiving
target groups
2018-01-27 12:03:06 +00:00
Callum Styan 7dc05538f7 docs: SD implementations do not have to only send new/changed target groups (#3713) 2018-01-26 22:03:11 +00:00
Frederic Branczyk cfa0253ed8
discovery: Schedule updates to throttle 2018-01-26 16:24:44 +01:00
zemek 8a01a0fbed Set consul server default to localhost:8500 (#3703) 2018-01-24 12:14:32 +00:00
Julius Volz 09e460a647
discovery: Rename file SD mtime metric (#3723)
- "timestamp" -> "mtime" to be in line with node exporter and clearer.
- add unit suffix
2018-01-22 14:02:24 +01:00
Krasi Georgiev ec26751fd2 use mutexes for the discovery manager instead of a loop as this was a stupid idea 2018-01-17 18:12:58 +00:00
Krasi Georgiev 767faa44b6 fixed the tests
Signed-off-by: Krasi Georgiev <krasi.root@gmail.com>
2018-01-15 13:39:47 +00:00
Krasi Georgiev d12e6f29fc discovery manager ApplyConfig now takes a direct ServiceDiscoveryConfig so that it can be used for the notify manager
reimplement the service discovery for the notify manager

Signed-off-by: Krasi Georgiev <krasi.root@gmail.com>
2018-01-15 13:39:44 +00:00
Goutham Veeramachaneni b20a1b1b1b
Merge pull request #3654 from krasi-georgiev/discovery-handle-discoverer-updates
discovery - handle Discoverers that send only target Group updates.
2018-01-15 18:53:22 +05:30
Krasi Georgiev 790cf30fcb remove uneeded check 2018-01-15 11:52:20 +00:00
Krasi Georgiev 38938ba493 comment nits 2018-01-15 11:47:36 +00:00
Krasi Georgiev febebcd49a more comments for the future ME, and reverted the Discovery manager execution changes as these were correct in the first place 2018-01-12 22:07:21 +00:00
Krasi Georgiev 78ba5e62a6 few mote usefull comments 2018-01-12 13:58:23 +00:00
Krasi Georgiev cabce21b70 delete empty targets sets to avoid memory leaks 2018-01-12 13:10:59 +00:00
Krasi Georgiev abfd9f1920 nits 2018-01-12 12:19:52 +00:00
Shubheksha Jalan 0471e64ad1 Use shared types from the common repo (#3674)
* refactor: use shared types from common repo, remove util/config

* vendor: add common/config

* fix nit
2018-01-11 16:10:25 +01:00
Krasi Georgiev 546c29af5b return early for nil target groups 2018-01-09 16:34:23 +00:00
Callum Styan 97464236c7 comments with TargetProvider should read Discoverer instead (#3667) 2018-01-08 23:59:18 +00:00
Krasi Georgiev 77bf6bece0 discovery-manager comment update 2018-01-04 21:57:28 +00:00
Krasi Georgiev 135ea0f793 discovery manager - doesn't need sorting of the target groups so move it in the discovery manager tests as we only need it there.
discovery manager - refactor the discovery tests.

Signed-off-by: Krasi Georgiev <krasi.root@gmail.com>
2018-01-04 21:41:54 +00:00
Krasi Georgiev 638818a974 some Discoverers send nil targetgroup so need to check for it when updating a group 2018-01-04 13:57:34 +00:00
Krasi Georgiev 7e28397a2c discovery - handle Discoverers that send only target Group updates rather than all Targets on every update.
Signed-off-by: Krasi Georgiev <krasi.root@gmail.com>
2018-01-04 13:28:37 +00:00
Shubheksha Jalan ec94df49d4 Refactor SD configuration to remove config dependency (#3629)
* refactor: move targetGroup struct and CheckOverflow() to their own package

* refactor: move auth and security related structs to a utility package, fix import error in utility package

* refactor: Azure SD, remove SD struct from config

* refactor: DNS SD, remove SD struct from config into dns package

* refactor: ec2 SD, move SD struct from config into the ec2 package

* refactor: file SD, move SD struct from config to file discovery package

* refactor: gce, move SD struct from config to gce discovery package

* refactor: move HTTPClientConfig and URL into util/config, fix import error in httputil

* refactor: consul, move SD struct from config into consul discovery package

* refactor: marathon, move SD struct from config into marathon discovery package

* refactor: triton, move SD struct from config to triton discovery package, fix test

* refactor: zookeeper, move SD structs from config to zookeeper discovery package

* refactor: openstack, remove SD struct from config, move into openstack discovery package

* refactor: kubernetes, move SD struct from config into kubernetes discovery package

* refactor: notifier, use targetgroup package instead of config

* refactor: tests for file, marathon, triton SD - use targetgroup package instead of config.TargetGroup

* refactor: retrieval, use targetgroup package instead of config.TargetGroup

* refactor: storage, use config util package

* refactor: discovery manager, use targetgroup package instead of config.TargetGroup

* refactor: use HTTPClient and TLS config from configUtil instead of config

* refactor: tests, use targetgroup package instead of config.TargetGroup

* refactor: fix tagetgroup.Group pointers that were removed by mistake

* refactor: openstack, kubernetes: drop prefixes

* refactor: remove import aliases forced due to vscode bug

* refactor: move main SD struct out of config into discovery/config

* refactor: rename configUtil to config_util

* refactor: rename yamlUtil to yaml_config

* refactor: kubernetes, remove prefixes

* refactor: move the TargetGroup package to discovery/

* refactor: fix order of imports
2017-12-29 21:01:34 +01:00
Callum Styan d76d5de66f refactor to make timestamp collector work for multiple file_sd's 2017-12-23 10:13:11 +00:00
KalivarapuReshma a00fc883c3 Add metric for timestamp of the files file_sd is using. 2017-12-23 10:13:11 +00:00
pasquier-s 78625f85a7 Fix race condition on file SD (#3468)
The file discovery should only stop the watcher if it has been created
otherwise it may trigger a segmentation fault.
2017-12-21 10:07:43 +00:00
Krasi Georgiev 587dec9eb9 rebased and resolved conflicts with the new Discovery GUI page
Signed-off-by: Krasi Georgiev <krasi.root@gmail.com>
2017-12-18 20:10:03 +00:00
Krasi Georgiev 80182a5d82 use poolKey as the pool map key to avoid multi dimensional maps 2017-12-18 17:23:47 +00:00
Krasi Georgiev 1ec76d1950 rearange the contexts variables and logic
split the groupsMerge function to set and get
other small nits
2017-12-18 17:23:47 +00:00
Krasi Georgiev f2df712166 updated README 2017-12-18 17:22:50 +00:00
Krasi Georgiev aca8f85699 fixed the tests 2017-12-18 17:22:50 +00:00
Krasi Georgiev fe6c544532 some renaming and comments fixes.
remove some select state that is most likely obsoleete and hoepfully doesn't braje anything :)
merge targets will sort by Discoverer name so we can have consistent tests for the maps.
2017-12-18 17:22:50 +00:00
Krasi Georgiev f5c2c5ff8f brake the start provider func so that can run unit tests against it. 2017-12-18 17:22:50 +00:00
Krasi Georgiev c5cb0d2910 simplify naming and API. 2017-12-18 17:22:50 +00:00
Krasi Georgiev 9c61f0e8a0 scrape pool doesn't rely on context as Stop() needs to be blocking to prevent Scrape loops trying to write to a closed TSDB storage. 2017-12-18 17:22:49 +00:00
Krasi Georgiev e405e2f1ea refactored discovery 2017-12-18 17:22:49 +00:00
Brian Brazil 81db4716c1
Mention SD moratorium in README (#3573) 2017-12-11 15:38:23 +00:00
Will Howard 6a80fc24cf Parse the normalized container.PortMappings presented by the Marathon 1.5.x API
Fixes #3465
2017-12-06 11:23:12 -05:00
Brian Brazil d7b3df5ae1 Fix staticcheck errors 2017-12-02 14:52:13 +00:00
Krasi Georgiev 29506e0bca one meaningless write to the config file to trigger anothe fsnotify (#3492) 2017-12-01 17:32:27 +00:00
Tom Wilkie 099c50ce93 Avoid empty pod UID in test. 2017-11-24 15:02:42 +00:00
Tom Wilkie 9811e90d65 Fix tests. 2017-11-24 12:24:13 +00:00
Tom Wilkie 06dc1e8797 Include Pod UID in the discovery metadata. 2017-11-20 21:09:47 +00:00
Tobias Schmidt 91be55ebf0
Merge pull request #3458 from grandbora/test-race
Fix race in test
2017-11-13 17:57:21 +01:00
Bora Tunca 493fd6bd1f Fix race in test 2017-11-13 11:47:59 -05:00
Krasi Georgiev 1005ef0a70 Fix flaky file discovery tests - sync the channel draining goroutine 2017-11-13 12:12:01 +00:00
Bora Tunca 3cc01a3088 Add more discovery tests for updating target groups (#3426)
* Adds a test covering the case where a target providers sends updated versions of the same target groups and the system should reconcile to the latest version of each of the target groups
* Refactors how input data is represented in the tests. It used to be literal declarations of necessary structs, now it is parsing yaml. Yaml declarations are half as long as the former. And these can be put in a fixture file
* Adds a tiny bit of refactoring on test timeouts
2017-11-12 03:39:08 +01:00