The automatic GZIP handling of net/http does not preserve
buffers across requests and thus generates a lot of garbage.
We handle GZIP ourselves to circumvent this.t
This decreases checkpoint size by not checkpointing things
that don't actually need checkpointing.
This is fully compatible with the v2 checkpoint format,
as it makes series appear as though the only chunksdescs
in memory are those that need persisting.
With this change the scraping caches series references and only
allocates label sets if it has to retrieve a new reference.
pkg/textparse is used to do the conditional parsing and reduce
allocations from 900B/sample to 0 in the general case.
The current description does not accurately describe when the metric is incremented.
Aside from Alertmanger missing from the configuration, `prometheus_notifications_dropped_total` is incremented when errors occur while sending alert notifications to Alertmanager, or because the notifications queue is full, or because the number of notifications to be sent exceeds the queue capacity.
I think calling these cases 'errors' in a generic sense is more useful than the current description.
Add metrics around checkpointing and persistence
* Add a metric to say if checkpointing is happening,
and another to track total checkpoint time and count.
This breaks the existing prometheus_local_storage_checkpoint_duration_seconds
by renaming it to prometheus_local_storage_checkpoint_last_duration_seconds
as the former name is more appropriate for a summary.
* Add metric for last checkpoint size.
* Add metric for series/chunks processed by checkpoints.
For long checkpoints it'd be useful to see how they're progressing.
* Add metric for dirty series
* Add metric for number of chunks persisted per series.
You can get the number of chunks from chunk_ops,
but not the matching number of series. This helps determine
the size of the writes being made.
* Add metric for chunks queued for persistence
Chunks created includes both chunks that'll need persistence
and chunks read in for queries. This only includes chunks created
for persistence.
* Code review comments on new persistence metrics.
These lines exercise an append in
TestScrapeLoopWrapSampleAppender. Arguably, append shouldn't be tested
there in the first place.
Still it's weird why this fails on Travis:
```
--- FAIL: TestScrapeLoopWrapSampleAppender (0.00s)
scrape_test.go:259: Expected count of 1, got 0
scrape_test.go:290: Expected count of 1, got 0
2017/01/07 22:48:26 http: TLS handshake error from 127.0.0.1:50716: read tcp 127.0.0.1:40265->127.0.0.1:50716: read: connection reset by peer
FAIL
FAIL github.com/prometheus/prometheus/retrieval 3.603s
```
Should anybody ever find out why, please revert this commit accordingly.
* Add max concurrent and current queries engine metrics
This commit adds two metrics to the promql/engine: the
number of max concurrent queries, as configured by the flag, and
the number of current queries being served+blocked in the engine.
retreival.Target contains a mutex. It was copied in the Targets()
call. This potentially can wreak a lot of havoc.
It might even have caused the issues reported as #2266 and #2262 .