We were missing testing on the behaviour of the configuration
unmarshalling.
This PR adds a refresh command that can be used to test that we
use the correct refresh function.
Signed-off-by: Julien Pivotto <roidelapluie@inuits.eu>
Since we dependend on go1.14 now, we can use T.Cleanup
https://golang.org/pkg/testing/#T.Cleanup
This provides a nicer approach to shut down the test server.
Signed-off-by: Julien Pivotto <roidelapluie@inuits.eu>
* OpenStack SD: Add availability config option, to choose endpoint type
In some environments Prometheus must query OpenStack via an alternative
endpoint type (gophercloud calls this `availability`.
This commit implements this option.
Co-Authored-By: Dennis Kuhn <d.kuhn@syseleven.de>
Signed-off-by: Steffen Neubauer <s.neubauer@syseleven.de>
use ?wait=10m will give results as fast as usual when data is changing
but will perform far less requests when services do not change.
On large infrastructure, this will reduce quite a lot the number of
qps on Consul servers while having the same performance for freshness
of results.
Signed-off-by: Pierre Souchay <p.souchay@criteo.com>
Previously `max` results stopped reading from results in tests
prematurely, as it stopped when `max` number of items were received from
the channel instead of `max` number of unique target groups received.
This caused flaky tests where the same target group was received
multiple times, as Kubernetes informers may emit the same event multiple
times.
Before this patch, running this test repeatedly failed eventually. After
this patch I have run the test many thousand times without failure.
```bash
go test -run TestEndpointsDiscoveryNamespaces -count 1000 -test.v
```
Signed-off-by: Frederic Branczyk <fbranczyk@gmail.com>
Added optional configuration item role, defaults to 'container' (backwards-compatible).
Setting role to 'cn' will discover compute nodes instead.
Human-friendly compute node hostname discovery depends on cmon 1.7.0:
c1a2aeca36
Adjust testcases to use discovery config per case as two different types are now supported.
Updated documentation:
* new role setting
* clarify what the name 'container' covers as triton uses different names in different locations
Signed-off-by: jzinkweg <jzinkweg@gmail.com>
Add extra meta labels which will be useful in the case
Prometheus discovery hypervisor .
Signed-off-by: pzqu <pzqu@qq.com>
Co-authored-by: pzqu <pzqu@example.com>
We can assume that not all target groups are nil in normal scernarios,
so we can create targets[poolKey] outside the loop.
Signed-off-by: Julien Pivotto <roidelapluie@inuits.eu>
The Kubernetes client records workqueue duration and latency metrics as
seconds so there's no need to convert the values from microseconds to
seconds anymore.
The cache metrics (prometheus_sd_kubernetes_cache_*) are removed because
they aren't used anymore by the client though still exposed by its API.
Signed-off-by: Simon Pasquier <spasquie@redhat.com>
* adding additional unit tests for getDataCenter() in consul
Signed-off-by: Jean-Baptiste Le Duigou <jb.leduigou@gmail.com>
* Consult Tests : update comments to start with uppercase and end with point
Signed-off-by: Jean-Baptiste Le Duigou <jb.leduigou@gmail.com>
* Consult Test : using table-driven tests
Signed-off-by: Jean-Baptiste Le Duigou <jb.leduigou@gmail.com>
* Consul Test : cleaner syntax
Signed-off-by: Jean-Baptiste Le Duigou <jb.leduigou@gmail.com>
* Consul Test : even cleaner syntax
Signed-off-by: Jean-Baptiste Le Duigou <jb.leduigou@gmail.com>
* Consul Test : update comments
Signed-off-by: Jean-Baptiste Le Duigou <jb.leduigou@gmail.com>
* Fixing naming convention by removing underscore in function name
Signed-off-by: Jean-Baptiste Le Duigou <jb.leduigou@gmail.com>
* Removing duplicated test case for getDatacenter()
Signed-off-by: Jean-Baptiste Le Duigou <jb.leduigou@gmail.com>
* adding unit test for target group
Signed-off-by: Jean-Baptiste Le Duigou <jb.leduigou@gmail.com>
* Improve unit tests for target group
Signed-off-by: Jean-Baptiste Le Duigou <jb.leduigou@gmail.com>
* Fix imports
Signed-off-by: Jean-Baptiste Le Duigou <jb.leduigou@gmail.com>
* Improve test by asserting on whole Target Group object
Signed-off-by: Jean-Baptiste Le Duigou <jb.leduigou@gmail.com>
- Use testutil.ToFloat64 to collect testing metrics
- Declare ServiceDiscoveryConfig directly instead of calling Unmarshal on a piece of YAML
Signed-off-by: Nevill <nevill.dutt@gmail.com>
* Update go.mod dependencies before release
Signed-off-by: Julius Volz <julius.volz@gmail.com>
* Add issue for showing query warnings in promtool
Signed-off-by: Julius Volz <julius.volz@gmail.com>
* Revert json-iterator back to 1.1.6
It produced errors when marshaling Point values with special float
values.
Signed-off-by: Julius Volz <julius.volz@gmail.com>
* Fix expected step values in promtool tests after client_golang update
Signed-off-by: Julius Volz <julius.volz@gmail.com>
* Update generated protobuf code after proto dep updates
Signed-off-by: Julius Volz <julius.volz@gmail.com>
With the next release of client_golang, Summaries will not have
objectives by default. To not lose the objectives we have right now,
explicitly state the current default objectives.
Signed-off-by: beorn7 <beorn@grafana.com>
From the documentation:
> The default HTTP client's Transport may not
> reuse HTTP/1.x "keep-alive" TCP connections if the Body is
> not read to completion and closed.
This effectively enable keep-alive for the fixed requests.
Signed-off-by: Romain Baugue <romain.baugue@elwinar.com>
Add extra meta labels which will be useful in the case
Prometheus discovery instances from all projects.
Signed-off-by: Kien Nguyen <kiennt2609@gmail.com>
i) Uses the more idiomatic Wrap and Wrapf methods for creating nested errors.
ii) Fixes some incorrect usages of fmt.Errorf where the error messages don't have any formatting directives.
iii) Does away with the use of fmt package for errors in favour of pkg/errors
Signed-off-by: tariqibrahim <tariq181290@gmail.com>
* discovery: factorize for SD based on refresh
Signed-off-by: Simon Pasquier <spasquie@redhat.com>
* discovery: use common metrics for refresh
Signed-off-by: Simon Pasquier <spasquie@redhat.com>
See,
$ codespell -S './vendor/*,./.git*,./web/ui/static/vendor*' --ignore-words-list="uint,dur,ue,iff,te,wan"
Signed-off-by: Mario Trangoni <mjtrangoni@gmail.com>
i) Increased the size of the Service Discovery Readme title
ii) Changed `TargetGroups` to "target groups" as it has been relocated and renamed to another package.
Signed-off-by: tariqibrahim <tariq181290@gmail.com>
Although it is spelling mistakes, it might make an affects
while reading.
Co-Authored-By: Kim Bao Long longkb@vn.fujitsu.com
Signed-off-by: Nguyen Hai Truong <truongnh@vn.fujitsu.com>
* discovery/kubernetes: fix support for password_file
Signed-off-by: Simon Pasquier <spasquie@redhat.com>
* Create and pass custom RoundTripper to Kubernetes client
Signed-off-by: Simon Pasquier <spasquie@redhat.com>
* Use inline HTTPClientConfig
Signed-off-by: Simon Pasquier <spasquie@redhat.com>
https://github.com/miekg/dns/pull/815 goes into the detail, but more or
less the existing solution was no longer supported and needed to be
rewritten to support the new versions of the library. miekg additionally
claims this is more correct in the ticket.
Signed-off-by: Erik Hollensbe <github@hollensbe.org>
* *: use latest release of staticcheck
It also fixes a couple of things in the code flagged by the additional
checks.
Signed-off-by: Simon Pasquier <spasquie@redhat.com>
* Use official release of staticcheck
Also run 'go list' before staticcheck to avoid failures when downloading packages.
Signed-off-by: Simon Pasquier <spasquie@redhat.com>
* vendor update
* discovery/gce: oauth2.NoContext is deprecated, replace with context.Background()
Signed-off-by: Erik Hollensbe <github@hollensbe.org>
* discovery: send empty group on blank SD config
Signed-off-by: Simon Pasquier <spasquie@redhat.com>
* Update comments
Signed-off-by: Simon Pasquier <spasquie@redhat.com>
* Add another comment
Signed-off-by: Simon Pasquier <spasquie@redhat.com>
* add logic to check if an azure VM is deallocated or not
* update documentation with the new azure power state label
Signed-off-by: tariqibrahim <tariq.ibrahim@microsoft.com>
* Adding private_dns_name to the list of ec2 labels which can be used in node naming for dynamic environments
Signed-off-by: Serghei Anicheev <serghei@rentalcover.com>
* discovery/azure: fail hard when client_id/client_secret is empty
Signed-off-by: mengnan <supernan1994@gmail.com>
* discovery/azure: fail hard when authentication parameters are missing
Signed-off-by: mengnan <supernan1994@gmail.com>
* add unit test
Signed-off-by: mengnan <supernan1994@gmail.com>
* add unit test
Signed-off-by: mengnan <supernan1994@gmail.com>
* format code
Signed-off-by: mengnan <supernan1994@gmail.com>
Fixes#4855 - ServicePort was wrongly used to construct an address to endpoints
defined in portMappings. This was changed to HostPort. Support for obtaining
auto-generated host ports was also added.
Signed-off-by: Timo Beckers <timo@incline.eu>
Currently Prometheus requests show up with a UA of Go-http-client/1.1
which isn't super helpful. Though the X-Prometheus-Remote-* headers
exist they need to be explicitly configured when logging the request in
order to be able to deduce this is a request originating from
Prometheus. By setting the header we remove this ambiguity and make
default server logs just a bit more useful.
This also updates a few other places to consistently capitalize the 'P'
in the user agent, as well as ensure we set a UA to begin with.
Signed-off-by: Daniele Sluijters <daenney@users.noreply.github.com>
Set __meta_ec2_platform label with the instance platform string. Set to 'windows' on Windows servers and absent otherwise.
Signed-off-by: Silvio Gissi <silvio@gissilabs.com>
By default, OpenStack SD only queries for instances
from specified project. To discover instances from other
projects, users have to add more openstack_sd_configs for
each project.
This patch adds `all_tenants` <bool> options to
openstack_sd_configs. For example:
- job_name: 'openstack_all_instances'
openstack_sd_configs:
- role: instance
region: RegionOne
identity_endpoint: http://<identity_server>/identity/v3
username: <username>
password: <super_secret_password>
domain_name: Default
all_tenants: true
Co-authored-by: Kien Nguyen <kiennt2609@gmail.com>
Signed-off-by: dmatosl <danielmatos.lima@gmail.com>
* *: move to go 1.11
Signed-off-by: Simon Pasquier <spasquie@redhat.com>
* Reduce number of places where we specify the Go version
Signed-off-by: Simon Pasquier <spasquie@redhat.com>
Additionally, add triton groups metadata to the discovery reponse
and correct a documentation error regarding the triton server id
metadata.
Signed-off-by: Richard Kiene <richard.kiene@joyent.com>
Commit 1c89984 introduced the ability to expose the owner of the instance.
However, this breaks Prometheus if there is no OwnerID in the reservation (Eg. if you are using a private EC2-API introduced by #4333)
Signed-off-by: Jannick Fahlbusch <git@jf-projects.de>
* marathon-sd - change port gathering strategy, add support for container networking
- removed unnecessary error check on HTTPClientConfig.Validate()
- renamed PortDefinitions and PortMappings to PortDefinition and PortMapping respectively
- extended data model for extra parsed fields from Marathon json
- support container networking on Marathon 1.5+ (target Task.IPAddresses.x.Address)
- expanded test suite to cover all new cases
- test: cancel context when reading from doneCh before returning from function
- test: split test suite into Ports/PortMappings/PortDefinitions
Signed-off-by: Timo Beckers <timo@incline.eu>
* Change discovery subpackages to not use testify in tests
Signed-off-by: Camille Janicki <camille.janicki@gmail.com>
* Remove testify suite from vendor dir
Signed-off-by: Camille Janicki <camille.janicki@gmail.com>
Removing a final dot changes the meaning of the name and can cause
extra DNS lookups as the resolver traverses its search path.
Signed-off-by: Bryan Boreham <bjboreham@gmail.com>
* tidy up the discovery logs,updating loops and selects
few objects renamings
removed a very noise debug log on the k8s discovery. It would be usefull
to show some summary rather than every update as this is impossible to
follow.
added most comments as debug logs so each block becomes self
explanatory.
when the discovery receiving channel is full will retry again on the
next cycle.
Signed-off-by: Krasi Georgiev <kgeorgie@redhat.com>
* add noop logger for the SD manager tests.
Signed-off-by: Krasi Georgiev <kgeorgie@redhat.com>
* spelling nits
Signed-off-by: Krasi Georgiev <kgeorgie@redhat.com>
* discovery: coalesce identical SD configurations
Instead of creating as many SD providers as declared in the
configuration, the discovery manager merges identical configurations
into the same provider and keeps track of the subscribers. When
the manager receives target updates from a SD provider, it will
broadcast the updates to all interested subscribers.
Signed-off-by: Simon Pasquier <spasquie@redhat.com>
* simplfied SD updates throtling
Signed-off-by: Krasi Georgiev <kgeorgie@redhat.com>
* add default to catch cases when we don't have new updates.
Signed-off-by: Krasi Georgiev <kgeorgie@redhat.com>
* Inital support for Azure VMSS
Signed-off-by: Johannes Scheuermann <johannes.scheuermann@inovex.de>
* Add documentation for the newly introduced label
Signed-off-by: Johannes M. Scheuermann <joh.scheuer@gmail.com>
Allowing to set a custom endpoint makes it easy to monitor targets on non AWS providers with EC2 compliant APIs.
Signed-off-by: Jannick Fahlbusch <git@jf-projects.de>
Especially for Kubernetes SD, this fixes a bug where the rendered
configuration says "api_server: null", which when read back is not
interpreted as an un-set API server (thus the default is not applied).
Signed-off-by: Julius Volz <julius.volz@gmail.com>
* config: set target group source index during unmarshalling
Fixes issue #4214 where the scrape pool is unnecessarily reloaded for a
config reload where the config hasn't changed. Previously, the discovery
manager changed the static config after loading which caused the in-memory
config to differ from a freshly reloaded config.
Signed-off-by: Paul Gier <pgier@redhat.com>
* [issue #4214] Test that static targets are not modified by discovery manager
Signed-off-by: Paul Gier <pgier@redhat.com>
Relabelling rules can use this information to attach the name of the controller
that has created a pod.
In turn, this can be used to slice metrics by workload at query time, ie.
"Give me all metrics that have been created by the $name Deployment"
Signed-off-by: Damien Lespiau <damien@weave.works>
* promql: Rewrote tests with testutil for functions_test
Signed-off-by: Elif T. Kuş <elifkus@gmail.com>
* pkg/relabel: Rewrote tests with testutil for relabel_test
Signed-off-by: Elif T. Kuş <elifkus@gmail.com>
* discovery/consul: Rewrote tests with testutil for consul_test
Signed-off-by: Elif T. Kuş <elifkus@gmail.com>
* scrape: Rewrote tests with testutil for manager_test
Signed-off-by: Elif T. Kuş <elifkus@gmail.com>
- Do initial listing and syncing to scrape manager, then register event
handlers may lost events happening in listing and syncing (if it
lasted a long time). We should register event handlers at the very
begining, before processing just wait until informers synced (sync in
informer will list all objects and call OnUpdate event handler).
- Use a queue then we don't block event callbacks and an object will be
processed only once if added multiple times before it being processed.
- Fix bug in `serviceUpdate` in endpoints.go, we should build endpoints
when `exists && err == nil`. Add `^TestEndpointsDiscoveryWithService`
tests to test this feature.
Testing:
- Use `k8s.io/client-go` testing framework and fake implementations which are
more robust and reliable for testing.
- `Test\w+DiscoveryBeforeRun` are used to test objects created before
discoverer runs
- `Test\w+DiscoveryAdd\w+` are used to test adding objects
- `Test\w+DiscoveryDelete\w+` are used to test deleting objects
- `Test\w+DiscoveryUpdate\w+` are used to test updating objects
- `TestEndpointsDiscoveryWithService\w+` are used to test endpoints
events triggered by services
- `cache.DeletedFinalStateUnknown` related stuffs are removed, because
we don't care deleted objects in store, we only need its name to send
a specical `targetgroup.Group` to scrape manager
Signed-off-by: Yecheng Fu <cofyc.jackson@gmail.com>
This adds support for basic authentication which closes#3090
The support for specifying the client timeout was removed as discussed in https://github.com/prometheus/common/pull/123. Marathon was the only sd mechanism doing this and configuring the timeout is done through `Context`.
DC/OS uses a custom `Authorization` header for authenticating. This adds 2 new configuration properties to reflect this.
Existing configuration files that use the bearer token will no longer work. More work is required to make this backwards compatible.
* consul: improve consul service discovery
Related to #3711
- Add the ability to filter by tag and node-meta in an efficient way (`/catalog/services`
allow filtering by node-meta, and returns a `map[string]string` or `service`->`tags`).
Tags and nore-meta are also used in `/catalog/service` requests.
- Do not require a call to the catalog if services are specified by name. This is important
because on large cluster `/catalog/services` changes all the time.
- Add `allow_stale` configuration option to do stale reads. Non-stale
reads can be costly, even more when you are doing them to a remote
datacenter with 10k+ targets over WAN (which is common for federation).
- Add `refresh_interval` to minimize the strain on the catalog and on the
service endpoint. This is needed because of that kind of behavior from
consul: https://github.com/hashicorp/consul/issues/3712 and because a catalog
on a large cluster would basically change *all* the time. No need to discover
targets in 1sec if we scrape them every minute.
- Added plenty of unit tests.
Benchmarks
----------
```yaml
scrape_configs:
- job_name: prometheus
scrape_interval: 60s
static_configs:
- targets: ["127.0.0.1:9090"]
- job_name: "observability-by-tag"
scrape_interval: "60s"
metrics_path: "/metrics"
consul_sd_configs:
- server: consul.service.par.consul.prod.crto.in:8500
tag: marathon-user-observability # Used in After
refresh_interval: 30s # Used in After+delay
relabel_configs:
- source_labels: [__meta_consul_tags]
regex: ^(.*,)?marathon-user-observability(,.*)?$
action: keep
- job_name: "observability-by-name"
scrape_interval: "60s"
metrics_path: "/metrics"
consul_sd_configs:
- server: consul.service.par.consul.prod.crto.in:8500
services:
- observability-cerebro
- observability-portal-web
- job_name: "fake-fake-fake"
scrape_interval: "15s"
metrics_path: "/metrics"
consul_sd_configs:
- server: consul.service.par.consul.prod.crto.in:8500
services:
- fake-fake-fake
```
Note: tested with ~1200 services, ~5000 nodes.
| Resource | Empty | Before | After | After + delay |
| -------- |:-----:|:------:|:-----:|:-------------:|
|/service-discovery size|5K|85MiB|27k|27k|27k|
|`go_memstats_heap_objects`|100k|1M|120k|110k|
|`go_memstats_heap_alloc_bytes`|24MB|150MB|28MB|27MB|
|`rate(go_memstats_alloc_bytes_total[5m])`|0.2MB/s|28MB/s|2MB/s|0.3MB/s|
|`rate(process_cpu_seconds_total[5m])`|0.1%|15%|2%|0.01%|
|`process_open_fds`|16|*1236*|22|22|
|`rate(prometheus_sd_consul_rpc_duration_seconds_count{call="services"}[5m])`|~0|1|1|*0.03*|
|`rate(prometheus_sd_consul_rpc_duration_seconds_count{call="service"}[5m])`|0.1|*80*|0.5|0.5|
|`prometheus_target_sync_length_seconds{quantile="0.9",scrape_job="observability-by-tag"}`|N/A|200ms|0.2ms|0.2ms|
|Network bandwidth|~10kbps|~2.8Mbps|~1.6Mbps|~10kbps|
Filtering by tag using relabel_configs uses **100kiB and 23kiB/s per service per job** and quite a lot of CPU. Also sends and additional *1Mbps* of traffic to consul.
Being a little bit smarter about this reduces the overhead quite a lot.
Limiting the number of `/catalog/services` queries per second almost removes the overhead of service discovery.
* consul: tweak `refresh_interval` behavior
`refresh_interval` now does what is advertised in the documentation,
there won't be more that one update per `refresh_interval`. It now
defaults to 30s (which was also the current waitTime in the consul query).
This also make sure we don't wait another 30s if we already waited 29s
in the blocking call by substracting the number of elapsed seconds.
Hopefully this will do what people expect it does and will be safer
for existing consul infrastructures.