This decreases checkpoint size by not checkpointing things
that don't actually need checkpointing.
This is fully compatible with the v2 checkpoint format,
as it makes series appear as though the only chunksdescs
in memory are those that need persisting.
With this change the scraping caches series references and only
allocates label sets if it has to retrieve a new reference.
pkg/textparse is used to do the conditional parsing and reduce
allocations from 900B/sample to 0 in the general case.
The current description does not accurately describe when the metric is incremented.
Aside from Alertmanger missing from the configuration, `prometheus_notifications_dropped_total` is incremented when errors occur while sending alert notifications to Alertmanager, or because the notifications queue is full, or because the number of notifications to be sent exceeds the queue capacity.
I think calling these cases 'errors' in a generic sense is more useful than the current description.
The former approach created unordered postings list by either
map iteration of new series being unsorted (fixable) or concurrent
writers creating new series interleaved.
We switch back to generating ephemeral references for a single batch.
Newly created series have to be re-set upon the next insert.
This exposes a reference number of a series represented by a label set
to clients.
Subsequent samples can be directly added via the reference rather than
repeatedly passing in the full labels. This drasitcally speeds up the
append process.
The appender chain uses different sections of the reference number for
assignment to child appenders and invalidating reference numbers as
necessary.
Clients can either pass out reference numbers themselves or have their
own optimized lookup, i.e. by directly associating unparsed metric
descriptors strings with reference numbers.
Add metrics around checkpointing and persistence
* Add a metric to say if checkpointing is happening,
and another to track total checkpoint time and count.
This breaks the existing prometheus_local_storage_checkpoint_duration_seconds
by renaming it to prometheus_local_storage_checkpoint_last_duration_seconds
as the former name is more appropriate for a summary.
* Add metric for last checkpoint size.
* Add metric for series/chunks processed by checkpoints.
For long checkpoints it'd be useful to see how they're progressing.
* Add metric for dirty series
* Add metric for number of chunks persisted per series.
You can get the number of chunks from chunk_ops,
but not the matching number of series. This helps determine
the size of the writes being made.
* Add metric for chunks queued for persistence
Chunks created includes both chunks that'll need persistence
and chunks read in for queries. This only includes chunks created
for persistence.
* Code review comments on new persistence metrics.
This adds a memory series holding several chunk to replace
the single head chunk per series so far.
This is necessary for uniform maximum chunk sizes in cases
where some series have higher frequency samples than others.
This adds a 4 sample buffer to every head chunk. The XOR
compression scheme may edit bytes in place. The minimum size
of a sample is 2 bits. So keeping the last 4 samples in an in-memory
buffer makes it safe to query the preceeding ones while samples
are added
These lines exercise an append in
TestScrapeLoopWrapSampleAppender. Arguably, append shouldn't be tested
there in the first place.
Still it's weird why this fails on Travis:
```
--- FAIL: TestScrapeLoopWrapSampleAppender (0.00s)
scrape_test.go:259: Expected count of 1, got 0
scrape_test.go:290: Expected count of 1, got 0
2017/01/07 22:48:26 http: TLS handshake error from 127.0.0.1:50716: read tcp 127.0.0.1:40265->127.0.0.1:50716: read: connection reset by peer
FAIL
FAIL github.com/prometheus/prometheus/retrieval 3.603s
```
Should anybody ever find out why, please revert this commit accordingly.