* storage: Added Chunks{Queryable/Querier/SeriesSet/Series/Iteratable. Added generic Merge{SeriesSet/Querier} implementation.
## Rationales:
In many places (e.g. chunk Remote read, Thanos Receive fetching chunk from TSDB), we operate on encoded chunks not samples.
This means that we unnecessary decode/encode, wasting CPU, time and memory.
This PR adds chunk iterator interfaces and makes the merge code to be reused between both seriesSets
I will make the use of it in following PR inside tsdb itself. For now fanout implements it and mergers.
All merges now also allows passing series mergers. This opens doors for custom deduplications other than TSDB vertical ones (e.g. offline one we have in Thanos).
## Changes
* Added Chunk versions of all iterating methods. It all starts in Querier/ChunkQuerier. The plan is that
Storage will implement both chunked and samples.
* Added Seek to chunks.Iterator interface for iterating over chunks.
* NewMergeChunkQuerier was added; Both this and NewMergeQuerier are now using generigMergeQuerier to share the code. Generic code was added.
* Improved tests.
* Added some TODO for further simplifications in next PRs.
Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com>
* Addressed Brian's comments.
Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com>
* Moved s/Labeled/SeriesLabels as per Krasi suggestion.
Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com>
* Addressed Krasi's comments.
Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com>
* Second iteration of Krasi comments.
Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com>
* Another round of comments.
Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com>
This is part of https://github.com/prometheus/prometheus/pull/5882 that can be done to simplify things.
All todos I added will be fixed in follow up PRs.
* querier.Querier, querier.Appender, querier.SeriesSet, and querier.Series interfaces merged
with storage interface.go. All imports that.
* querier.SeriesIterator replaced by chunkenc.Iterator
* Added chunkenc.Iterator.Seek method and tests for xor implementation (?)
* Since we properly handle SelectParams for Select methods I adjusted min max
based on that. This should help in terms of performance for queries with functions like offset.
* added Seek to deletedIterator and test.
* storage/tsdb was removed as it was only a unnecessary glue with incompatible structs.
No logic was changed, only different source of abstractions, so no need for benchmarks.
Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com>
* Track remote write queues via a map so we don't care about index.
Signed-off-by: Callum Styan <callumstyan@gmail.com>
* Support a job name for remote write/read so we can differentiate between
them using the name.
Signed-off-by: Callum Styan <callumstyan@gmail.com>
* Remote write/read has Name to not confuse the meaning of the field with
scrape job names.
Signed-off-by: Callum Styan <callumstyan@gmail.com>
* Split queue/client label into remote_name and url labels.
Signed-off-by: Callum Styan <callumstyan@gmail.com>
* Don't allow for duplicate remote write/read configs.
Signed-off-by: Callum Styan <callumstyan@gmail.com>
* Ensure we restart remote write queues if the hash of their config has
not changed, but the remote name has changed.
Signed-off-by: Callum Styan <callumstyan@gmail.com>
* Include name in remote read/write config hashes, simplify duplicates
check, update test accordingly.
Signed-off-by: Callum Styan <callumstyan@gmail.com>
This allows other processes to reuse just the remote write code without
having to use the remote read code as well. This will be used to create
a sidecar capable of sending remote write payloads.
Signed-off-by: Chris Marchbanks <csmarchbanks@gmail.com>
* Update remote write and remote read separately
* Add external labels to the remote write conf hash
* Add unit tests for remote storage lifecycle
Signed-off-by: Chris Marchbanks <csmarchbanks@gmail.com>
* Consistently pre-lookup the metrics for a given queue in queue manager.
* Don't open the WAL (for writing) in the remote_write code.
* Add some more logging.
Signed-off-by: Tom Wilkie <tom.wilkie@gmail.com>
- Remove prometheus_remote_queue_last_send_timestamp_seconds metric. Its not particularly useful, we have highest_timestamp_seconds.
- Factor out maxGauage, a gauge that only increases.
- Change sharding calculations to use max samples in timestamp - max samples out timestamp (not rates).
- Also include the ratio of samples dropped to correctly predict number of pending samples.
Signed-off-by: Tom Wilkie <tom.wilkie@gmail.com>
- Add a dropped samples EWMA and use it in calculating desired shards.
- Update metric names and a log messages.
- Limit number of entries in the dedupe logging middleware to prevent potential OOM.
Signed-off-by: Callum Styan <callumstyan@gmail.com>
Signed-off-by: Tom Wilkie <tom.wilkie@gmail.com>
- If we're replaying the WAL to get series records, skip that segment when we hit corruptions.
- If we're tailing the WAL for samples, fail the watcher.
- When the watcher fails, restart from the latest checkpoint - and only send new samples by updating startTime.
- Tidy up log lines and error handling, don't return so many errors on quiting.
- Expect EOF when processing checkpoints.
Signed-off-by: Tom Wilkie <tom.wilkie@gmail.com>
- Remove datarace in the exported highest scrape timestamp.
- Backoff on enqueue should be per-sample - reset the result for each sample.
- Remove diffKeys, unused ctx and cancelfunc in WALWatcher, 'name' from writeTo interface, and pass it to constructor.
- Reorder functions in WALWatcher depth-first according to call graph.
- Fix vendor/modules.txt.
- Split out the various timer periods into consts at the top of the file.
- Move w.currentSegmentMetric.Set close to where we set the currentSegment.
- Combine r.Next() and isClosed(w.quit) into a single loop.
- Unnest some ifs in WALWatcher.watch, propagate erros in decodeRecord, add some new lines to make it easier to read.
- Reorganise checkpoint handling to reduce nesting and make it easier to follow.
Signed-off-by: Tom Wilkie <tom.wilkie@gmail.com>
This change switches the remote_write API to use the TSDB WAL. This should reduce memory usage and prevent sample loss when the remote end point is down.
We use the new LiveReader from TSDB to tail WAL segments. Logic for finding the tracking segment is included in this PR. The WAL is tailed once for each remote_write endpoint specified. Reading from the segment is based on a ticker rather than relying on fsnotify write events, which were found to be complicated and unreliable in early prototypes.
Enqueuing a sample for sending via remote_write can now block, to provide back pressure. Queues are still required to acheive parallelism and batching. We have updated the queue config based on new defaults for queue capacity and pending samples values - much smaller values are now possible. The remote_write resharding code has been updated to prevent deadlocks, and extra tests have been added for these cases.
As part of this change, we attempt to guarantee that samples are not lost; however this initial version doesn't guarantee this across Prometheus restarts or non-retryable errors from the remote end (eg 400s).
This changes also includes the following optimisations:
- only marshal the proto request once, not once per retry
- maintain a single copy of the labels for given series to reduce GC pressure
Other minor tweaks:
- only reshard if we've also successfully sent recently
- add pending samples, latest sent timestamp, WAL events processed metrics
Co-authored-by: Chris Marchbanks <csmarchbanks.com> (initial prototype)
Co-authored-by: Tom Wilkie <tom.wilkie@gmail.com> (sharding changes)
Signed-off-by: Callum Styan <callumstyan@gmail.com>
For special remote read endpoints which have only data for specific
queries, it is desired to limit the number of queries sent to the
configured remote read endpoint to reduce latency and performance
overhead.
* Decouple remote client from ReadRecent feature.
* Separate remote read filter into a small, testable function.
* Use storage.Queryable interface to compose independent
functionalities.
Currently all read queries are simply pushed to remote read clients.
This is fine, except for remote storage for wich it unefficient and
make query slower even if remote read is unnecessary.
So we need instead to compare the oldest timestamp in primary/local
storage with the query range lower boundary. If the oldest timestamp
is older than the mint parameter, then there is no need for remote read.
This is an optionnal behavior per remote read client.
Signed-off-by: Thibault Chataigner <t.chataigner@criteo.com>