The automatic GZIP handling of net/http does not preserve
buffers across requests and thus generates a lot of garbage.
We handle GZIP ourselves to circumvent this.t
With this change the scraping caches series references and only
allocates label sets if it has to retrieve a new reference.
pkg/textparse is used to do the conditional parsing and reduce
allocations from 900B/sample to 0 in the general case.
These lines exercise an append in
TestScrapeLoopWrapSampleAppender. Arguably, append shouldn't be tested
there in the first place.
Still it's weird why this fails on Travis:
```
--- FAIL: TestScrapeLoopWrapSampleAppender (0.00s)
scrape_test.go:259: Expected count of 1, got 0
scrape_test.go:290: Expected count of 1, got 0
2017/01/07 22:48:26 http: TLS handshake error from 127.0.0.1:50716: read tcp 127.0.0.1:40265->127.0.0.1:50716: read: connection reset by peer
FAIL
FAIL github.com/prometheus/prometheus/retrieval 3.603s
```
Should anybody ever find out why, please revert this commit accordingly.
retreival.Target contains a mutex. It was copied in the Targets()
call. This potentially can wreak a lot of havoc.
It might even have caused the issues reported as #2266 and #2262 .
This imposes a hard limit on the number of samples ingested from the
target. This is counted after metric relabelling, to allow dropping of
problemtic metrics.
This is intended as a very blunt tool to prevent overload due to
misbehaving targets that suddenly jump in sample count (e.g. adding
a label containing email addresses).
Add metric to track how often this happens.
Fixes#2137
This also removes closing of the target group channel everywhere
as the contexts cancels across all stages and we don't care about
draining all events once that happened.
* retrieval/discovery/kubernetes: fix cache state unknown behavior
* retrieval/discovery/kubernetes: extract type casting
* retrieval/discovery/kubernetes: add tests for possible regressions
* Add config, HTTP Basic Auth and TLS support to the generic write path.
- Move generic write path configuration to the config file
- Factor out config.TLSConfig -> tlf.Config translation
- Support TLSConfig for generic remote storage
- Rename Run to Start, and make it non-blocking.
- Dedupe code in httputil for TLS config.
- Make remote queue metrics global.
Switched to testing by way of the static_configs rather than
dns_sd_config parameter. Verified that the revised test both passes
without network access, and also still catches the bug it's supposed to
cover.
Verify that if the configs change, target groups are cleaned on
TargetManager.reload (rather than having old ones linger around, even if
they are no longer present in the configs).
This covers the bug fixed in #1907 -- I verified that by checking out
source from before that commit.
This is a start on #1906
Also, remove unused `providers` field in targetSet.
If the config file changes, we recreate all providers (by calling
`providersFromConfig`) and retrieve all targets anew from the newly
created providers. From that perspective, it cannot harm to clean up
the target group map in the targetSet. Not doing so (as it was the
case so far) keeps stale targets around. This mattered if an existing
key in the target group map was not overwritten in the initial fetch
of all targets from the providers. Examples where that mattered:
```
scrape_configs:
- job_name: "foo"
static_configs:
- targets: ["foo:9090"]
- targets: ["bar:9090"]
```
updated to:
```
scrape_configs:
- job_name: "foo"
static_configs:
- targets: ["foo:9090"]
```
`bar:9090` would still be monitored. (The static provider just
enumerates the target groups. If the number of target groups
decreases, the old ones stay around.
```
scrape_configs:
- job_name: "foo"
dns_sd_configs:
- names:
- "srv.name.one.example.org"
```
updated to:
```
scrape_configs:
- job_name: "foo"
dns_sd_configs:
- names:
- "srv.name.two.example.org"
```
Now both SRV records are still monitored. The SRV name is part of the
key in the target group map, thus the new one is just added and the
old ane stays around.
Obviously, this should have tests, and should have tests before, not
only for this case. This is the quick fix. I have created
https://github.com/prometheus/prometheus/issues/1906 to track test
creation.
Fixes https://github.com/prometheus/prometheus/issues/1610 .
0 is considered an invalid interval by time.NewTicker() and will cause a
panic if control reaches that point. Given the vagaries of timekeeping,
this may occasionally happen and make this test unstable.
Apart from not trying to send a newline in a HTTP header,
this also allows Prometheus to build and pass tests with Go 1.7,
which features stricter checking of HTTP headers.
This adds `role` field to the Kubernetes SD config, which indicates
which type of Kubernetes SD should be run.
This no longer allows discovering pods and nodes with the same SD
configuration for example.
This change does just signal a scrape target update to the scraping loop
once an initial target set is fetched.
Before, the scrape pool was directly synced, causing a race against an
uninitialized scrape pool.
Fixes#1703
This change does just signal a scrape target update to the scraping loop
once an initial target set is fetched.
Before, the scrape pool was directly synced, causing a race against an
uninitialized scrape pool.
Fixes#1703
The concurrency applied before is in most cases not even needed. With
a cap=1 channel, most tests are much cleaner.
TestMarathonSDRunAndStop was trickier. It could even have blocked
before.
This also includes a general refactoring of the whole file.
This change deprecates the `target_groups` option in favor
of `static_configs`. The old configuration is still accepted
but prints a warning.
Configuration loading errors if both options are set.
From the documentation and current tests, it wasn't immediately clear to
me whether the `target` being dropped as the result of a 'drop' action
was a label key-value pair or the entire labelset.
Add a test that documents this behaviour.
Documentation: https://prometheus.io/docs/operating/configuration/
So far, out-of-order samples during rule evaluation were not logged,
and neither scrape health samples. The latter are unlikely to cause
any errors. That's why I'm logging them always now. (It's alway highly
irregular should it happen.) For rules, I have used the same plumbing
as for samples, just with a different wording in the message to mark
them as a result of rule evaluation.