Switched to testing by way of the static_configs rather than
dns_sd_config parameter. Verified that the revised test both passes
without network access, and also still catches the bug it's supposed to
cover.
Verify that if the configs change, target groups are cleaned on
TargetManager.reload (rather than having old ones linger around, even if
they are no longer present in the configs).
This covers the bug fixed in #1907 -- I verified that by checking out
source from before that commit.
This is a start on #1906
Also, remove unused `providers` field in targetSet.
If the config file changes, we recreate all providers (by calling
`providersFromConfig`) and retrieve all targets anew from the newly
created providers. From that perspective, it cannot harm to clean up
the target group map in the targetSet. Not doing so (as it was the
case so far) keeps stale targets around. This mattered if an existing
key in the target group map was not overwritten in the initial fetch
of all targets from the providers. Examples where that mattered:
```
scrape_configs:
- job_name: "foo"
static_configs:
- targets: ["foo:9090"]
- targets: ["bar:9090"]
```
updated to:
```
scrape_configs:
- job_name: "foo"
static_configs:
- targets: ["foo:9090"]
```
`bar:9090` would still be monitored. (The static provider just
enumerates the target groups. If the number of target groups
decreases, the old ones stay around.
```
scrape_configs:
- job_name: "foo"
dns_sd_configs:
- names:
- "srv.name.one.example.org"
```
updated to:
```
scrape_configs:
- job_name: "foo"
dns_sd_configs:
- names:
- "srv.name.two.example.org"
```
Now both SRV records are still monitored. The SRV name is part of the
key in the target group map, thus the new one is just added and the
old ane stays around.
Obviously, this should have tests, and should have tests before, not
only for this case. This is the quick fix. I have created
https://github.com/prometheus/prometheus/issues/1906 to track test
creation.
Fixes https://github.com/prometheus/prometheus/issues/1610 .
0 is considered an invalid interval by time.NewTicker() and will cause a
panic if control reaches that point. Given the vagaries of timekeeping,
this may occasionally happen and make this test unstable.
Apart from not trying to send a newline in a HTTP header,
this also allows Prometheus to build and pass tests with Go 1.7,
which features stricter checking of HTTP headers.
This adds `role` field to the Kubernetes SD config, which indicates
which type of Kubernetes SD should be run.
This no longer allows discovering pods and nodes with the same SD
configuration for example.
This change does just signal a scrape target update to the scraping loop
once an initial target set is fetched.
Before, the scrape pool was directly synced, causing a race against an
uninitialized scrape pool.
Fixes#1703
This change does just signal a scrape target update to the scraping loop
once an initial target set is fetched.
Before, the scrape pool was directly synced, causing a race against an
uninitialized scrape pool.
Fixes#1703
The concurrency applied before is in most cases not even needed. With
a cap=1 channel, most tests are much cleaner.
TestMarathonSDRunAndStop was trickier. It could even have blocked
before.
This also includes a general refactoring of the whole file.
This change deprecates the `target_groups` option in favor
of `static_configs`. The old configuration is still accepted
but prints a warning.
Configuration loading errors if both options are set.
From the documentation and current tests, it wasn't immediately clear to
me whether the `target` being dropped as the result of a 'drop' action
was a label key-value pair or the entire labelset.
Add a test that documents this behaviour.
Documentation: https://prometheus.io/docs/operating/configuration/
So far, out-of-order samples during rule evaluation were not logged,
and neither scrape health samples. The latter are unlikely to cause
any errors. That's why I'm logging them always now. (It's alway highly
irregular should it happen.) For rules, I have used the same plumbing
as for samples, just with a different wording in the message to mark
them as a result of rule evaluation.
But only on DEBUG level.
Also, count and report the two cases of out-of-order timestamps on the
one hand and same timestamp but different value on the other hand
separately.
Prometheus is Apache 2 licensed, and most source files have the
appropriate copyright license header, but some were missing it without
apparent reason. Correct that by adding it.
This fixes the case where a target provider closes the update
channel and exits before the context is canceled.
This should only be true for the static provider but it's safer
to generally handle this case.
This commit simplifies the TargetHealth type and moves the target
status into the target itself. This also removes a race where error
and last scrape time could have been out of sync.
This commit removes the scrapeConfig entirely from Target.
All identity defining parameters are thus immutable now and the mutex
can be removed..
Target identity is now correctly defined by the labels and the full URL.
This in particular includes URL parameters that are not specified in the
label set.
Fingerprint is also removed from hash to remove an unnecessary tight coupling
to the common/model package.
With this commit the scrape pool deduplicates incoming
targets before scraping them. This way multiple target providers
can produce the same target but it will be scraped only once.
This commit updates a target set's scrape configuration
on reload. This will cause all running scrape loops to be
stopped and started again with new parameters.
This commit changes the scraper interface to accept a timestamp
so the reported timestamp by the caller and the timestamp
attached to samples does not differ.
This commit moves Scraper handling into a separate scrapePool type.
TargetSets only manage TargetProvider lifecycles and sync the
retrieved updates to the scrapePool.
TargetProviders are now expected to send a full initial target set
within 5 seconds. The scrapePools preserve target state across reloads
and only drop targets after the initial set was synced.
We group providers by their scrape configuration. Each provider produces
target groups with an unique identifier.
On stopping a set of target providers we cancel the target providers,
stop scraping the targets and wait for the scrapers to finish.
On configuration reload all provider sets are stopped and new ones
are created. This will make targets disappear briefly on configuration
reload. Potentially scrapes are missed but due to the consistent
scrape intervals implemented recently, the impact is minor.
Double acquisition of the RLock usually doesn't blow up, but if the
write lock is called for between the two RLock's, we are deadlocked.
This deadlock does not exist in release-0.17, BTW.
With recent changes to a Target's internal data representation
updating by fullLabels() assigns the additional default
instance label. This breaks target identity comparison and causes
identical targets from service discovery to be constantly swapped.
So far we were using the InstanceIdentifier to compare equality of targets.
This is not always accurate, for example for the blackbox exporter where the
actual target is in the parameter.
For historic reasons we were enforcing a timeout directly
via the TCP dialer. This is no longer necessary for quite a while now.
Switching to context.Context will allow us to properly terminate
requests on shutdown as well.
To evenly distribute scraping load we currently rely on random
jittering. This commit hashes over the target's identity and calculates
a consistent offset. This also ensures that scrape intervals
are constantly spaced between config/target changes.
This gives up on the idea to communicate throuh the Append() call (by
either not returning as it is now or returning an error as
suggested/explored elsewhere). Here I have added a Throttled() call,
which has the advantage that it can be called before a whole _batch_
of Append()'s. Scrapes will happen completely or not at all. Same for
rule group evaluations. That's a highly desired behavior (as discussed
elsewhere). The code is even simpler now as the whole ingestion buffer
could be removed.
Logging of throttled mode has been streamlined and will create at most
one message per minute.
It's actually happening in several places (and for flags, we use the
standard Go time.Duration...). This at least reduces all our
home-grown parsing to one place (in model).
nerve's registration format differs from serverset. With this commit
there is now a dedicated treecache file in util,
and two separate files for serverset and nerve.
Reference:
https://github.com/airbnb/nerve
For the SNMP and blackbox exporters where
the ports tends to not be 80/443 and indeed
there may not be a port this makes the relabelling
a bit simpler as you don't have to figure out this
logic exists and strip off the :80.
This is a breaking change for the example configs of
those exporters.
With the blackbox exporter, the instance label will commonly
be used for things other than hostnames so remove this restriction.
https://example.com or https://example.com/probe/me are some examples.
To prevent user error, check that urls aren't provided as targets
when there's no relabelling that could potentically fix them.
1. static credentials replaced with defaults.DefaultChainCredentials.
This change ensures that credentials are sourced form all possible
providers available with the aws sdk, in the following order:
env variables, shared awsconfig file in user folder, ec2 instance role.
2. Added a few labels: AvailabilityZone, PublicDns, VpcId (if
available), SubnetId (if in Vpc)
The prefixed target provider changed a pointerized target group that was
reused in the wrapped target provider, causing an ever-increasing chain
of source prefixes in target groups from the Consul target provider.
We now make this bug generally impossible by switching the target group
channel from pointer to value type and thus ensuring that target groups
are copied before being passed on to other parts of the system.
I tried to not let the depointerization leak too far outside of the
channel handling (both upstream and downstream) because I tried that
initially and caused some nasty bugs, which I want to minimize.
Fixes https://github.com/prometheus/prometheus/issues/1083