Once a scrape target has reached the samle_limit, any further samples
from that target will continue to result in errSampleLimit.
This means that after the completion of the append loop `err` must still
be errSampleLimit, and so the metric would not have been incremented.
We were not properly maintaining the scrape cache when the same metric
was exposed with a different string representation.
This overall reduces the scraping cache's complexity, which fixes the
issue and saves about 10% of memory in a scraping-only Prometheus
instance.
This adds a bucketed buffer pool to the scrapers so we don't have to
allocate a new buffer on each scrape or hold it fixed to the scrape
loop.
The latter can consume significant amounts of unused memory, e.g. 4GB
when scraping 2MB /metrics from 2000 targets.
When a target is no longer returned from SD stop()
is called. However it may be recreated before the
next scrape interval happens. So we wait to set stalemarkers
until the scrape of the new target would have happened
and been ingested, which is 2 scrape intervals.
If we're shutting down the context will be cancelled,
so return immediately rather than holding things up for potentially
minutes waiting to safely set stalemarkers no newer than now.
If the server starts immediately back up again all is well.
If not, we're missing some stale markers.
In Prometheus 1.x one sample that is out of order
or that has a duplicate timestamp is discarded, and
the rest of the scrape ingestion continues on.
This will now also be true for 2.0.
With this change the scraping caches series references and only
allocates label sets if it has to retrieve a new reference.
pkg/textparse is used to do the conditional parsing and reduce
allocations from 900B/sample to 0 in the general case.
These lines exercise an append in
TestScrapeLoopWrapSampleAppender. Arguably, append shouldn't be tested
there in the first place.
Still it's weird why this fails on Travis:
```
--- FAIL: TestScrapeLoopWrapSampleAppender (0.00s)
scrape_test.go:259: Expected count of 1, got 0
scrape_test.go:290: Expected count of 1, got 0
2017/01/07 22:48:26 http: TLS handshake error from 127.0.0.1:50716: read tcp 127.0.0.1:40265->127.0.0.1:50716: read: connection reset by peer
FAIL
FAIL github.com/prometheus/prometheus/retrieval 3.603s
```
Should anybody ever find out why, please revert this commit accordingly.
This imposes a hard limit on the number of samples ingested from the
target. This is counted after metric relabelling, to allow dropping of
problemtic metrics.
This is intended as a very blunt tool to prevent overload due to
misbehaving targets that suddenly jump in sample count (e.g. adding
a label containing email addresses).
Add metric to track how often this happens.
Fixes#2137