When out-of-order is enabled, queries go through both Head and OOOHead,
and they both execute the same PostingsForMatchers call, as memSeries
are shared for both.
In some cases these calls can be heavy, and also frequent. We can
deduplicate those calls by using the PostingsForMatchers cache that we
already use for query sharding.
The usage of this cache can skip a newly appended series in the results
for the duration of the ttl.
Signed-off-by: Oleg Zaytsev <mail@olegzaytsev.com>
The wlog.WL type can now be used to create a Write Ahead Log or a Write
Behind Log.
Before the prefix for wbl metrics was
'prometheus_tsdb_out_of_order_wal_' and has been replaced with
'prometheus_tsdb_out_of_order_wbl_'.
Signed-off-by: Jesus Vazquez <jesus.vazquez@grafana.com>
Signed-off-by: Jesus Vazquez <jesusvazquez@users.noreply.github.com>
Co-authored-by: Ganesh Vernekar <15064823+codesome@users.noreply.github.com>
* TSDB chunks: remove race between writing and reading
Because the data is stored as a bit-stream, the last byte in the stream
could change if the stream is appended to after an Iterator is obtained.
Copy the last byte when the Iterator is created, so we don't have to
read it later.
Clarify in comments that concurrent Iterator and Appender are allowed,
but the chunk must not be modified while an Iterator is created.
(This was already the case, in order to copy the bstream slice header.)
* TSDB: stop saving last 4 samples in memSeries
This extra copy of the last 4 samples was introduced to avoid a race
condition between reading the last byte of the chunk and writing to it.
But now we have fixed that by having `bstreamReader` copy the last byte,
we don't need to copy the last 4 samples.
This change saves 56 bytes per series, which is very worthwhile when
you have millions or tens of millions of series.
* TSDB: tidy up stopIterator re-use
Previous changes have left this code duplicating some lines; pull
them out to a separate function and tidy up.
* TSDB head_test: stop checking when iterators are wrapped
The behaviour has changed so chunk iterators are only wrapped when
transaction isolation requires them to stop short of the end.
This makes tests fail which are checking the type.
Tests should check the observable behaviour, not the type.
Signed-off-by: Bryan Boreham <bjboreham@gmail.com>
Signed-off-by: Ganesh Vernekar <ganeshvern@gmail.com>
Co-authored-by: Ganesh Vernekar <ganeshvern@gmail.com>
* tsdb: add a basic test for read/write isolation
* tsdb: store the min time with isolationAppender
So that we can see when appending has moved past a certain point in time.
* tsdb: allow RangeHead to have isolation disabled
This will be used when for head compaction.
* tsdb: do head compaction with isolation disabled
This saves a lot of work tracking appends done while compaction is ongoing.
Signed-off-by: Bryan Boreham <bjboreham@gmail.com>
* tsdb: remove chunkRange from memSeries
chunkRange is the (oddly-named) configured duration for the head block.
We don't need a copy of this value per series. Pass it down where
required, and remove the copy.
The value in `Head` is only updated in `resetInMemoryState()`, which
also discards all `memSeries`.
* tsdb: remove oooCapMax from memSeries
oooCapMax is the configured maximum capacity for an out-of-order chunk.
Storing it per-series uses extra memory, and has surprising behaviour
if users change the value in config - series created before the change
will keep their old value.
Instead, pass it down where required, and remove the per-series value.
Signed-off-by: Bryan Boreham <bjboreham@gmail.com>
Signed-off-by: Ganesh Vernekar <ganeshvern@gmail.com>
Co-authored-by: Ganesh Vernekar <ganeshvern@gmail.com>
* Introduce out-of-order TSDB support
This implementation is based on this design doc:
https://docs.google.com/document/d/1Kppm7qL9C-BJB1j6yb6-9ObG3AbdZnFUBYPNNWwDBYM/edit?usp=sharing
This commit adds support to accept out-of-order ("OOO") sample into the TSDB
up to a configurable time allowance. If OOO is enabled, overlapping querying
are automatically enabled.
Most of the additions have been borrowed from
https://github.com/grafana/mimir-prometheus/
Here is the list ist of the original commits cherry picked
from mimir-prometheus into this branch:
- 4b2198d7ec
- 2836e5513f
- 00b379c3a5
- ff0dc75758
- a632c73352
- c6f3d4ab33
- 5e8406a1d4
- abde1e0ba1
- e70e769889
- df59320886
Co-authored-by: Jesus Vazquez <jesus.vazquez@grafana.com>
Co-authored-by: Ganesh Vernekar <ganeshvern@gmail.com>
Co-authored-by: Dieter Plaetinck <dieter@grafana.com>
Signed-off-by: Jesus Vazquez <jesus.vazquez@grafana.com>
* gofumpt files
Signed-off-by: Jesus Vazquez <jesus.vazquez@grafana.com>
* Add license header to missing files
Signed-off-by: Jesus Vazquez <jesus.vazquez@grafana.com>
* Fix OOO tests due to existing chunk disk mapper implementation
Signed-off-by: Jesus Vazquez <jesus.vazquez@grafana.com>
* Fix truncate int overflow
Signed-off-by: Jesus Vazquez <jesus.vazquez@grafana.com>
* Add Sync method to the WAL and update tests
Signed-off-by: Jesus Vazquez <jesus.vazquez@grafana.com>
* remove useless sync
Signed-off-by: Jesus Vazquez <jesus.vazquez@grafana.com>
* Update minOOOTime after truncating Head
* Update minOOOTime after truncating Head
Signed-off-by: Ganesh Vernekar <ganeshvern@gmail.com>
* Fix lint
Signed-off-by: Ganesh Vernekar <ganeshvern@gmail.com>
* Add a unit test
Signed-off-by: Ganesh Vernekar <ganeshvern@gmail.com>
Signed-off-by: Jesus Vazquez <jesus.vazquez@grafana.com>
* Load OutOfOrderTimeWindow only once per appender
Signed-off-by: Jesus Vazquez <jesus.vazquez@grafana.com>
* Fix OOO Head LabelValues and PostingsForMatchers
Signed-off-by: Jesus Vazquez <jesus.vazquez@grafana.com>
* Fix replay of OOO mmap chunks
Signed-off-by: Ganesh Vernekar <ganeshvern@gmail.com>
* Remove unnecessary err check
Signed-off-by: Jesus Vazquez <jesus.vazquez@grafana.com>
* Prevent panic with ApplyConfig
Signed-off-by: Ganesh Vernekar 15064823+codesome@users.noreply.github.com
Signed-off-by: Jesus Vazquez <jesus.vazquez@grafana.com>
* Run OOO compaction after restart if there is OOO data from WBL
Signed-off-by: Ganesh Vernekar 15064823+codesome@users.noreply.github.com
Signed-off-by: Jesus Vazquez <jesus.vazquez@grafana.com>
* Apply Bartek's suggestions
Co-authored-by: Bartlomiej Plotka <bwplotka@gmail.com>
Signed-off-by: Jesus Vazquez <jesus.vazquez@grafana.com>
* Refactor OOO compaction
Signed-off-by: Ganesh Vernekar <ganeshvern@gmail.com>
* Address comments and TODOs
- Added a comment explaining why we need the allow overlapping
compaction toggle
- Clarified TSDBConfig OutOfOrderTimeWindow doc
- Added an owner to all the TODOs in the code
Signed-off-by: Jesus Vazquez <jesus.vazquez@grafana.com>
* Run go format
Signed-off-by: Jesus Vazquez <jesus.vazquez@grafana.com>
* Fix remaining review comments
Signed-off-by: Ganesh Vernekar <ganeshvern@gmail.com>
* Fix tests
Signed-off-by: Ganesh Vernekar <ganeshvern@gmail.com>
* Change wbl reference when truncating ooo in TestHeadMinOOOTimeUpdate
Signed-off-by: Jesus Vazquez <jesus.vazquez@grafana.com>
* Fix TestWBLAndMmapReplay test failure on windows
Signed-off-by: Ganesh Vernekar <ganeshvern@gmail.com>
* Address most of the feedback
Signed-off-by: Ganesh Vernekar <ganeshvern@gmail.com>
* Refactor the block meta for out of order
Signed-off-by: Ganesh Vernekar <ganeshvern@gmail.com>
* Fix windows error
Signed-off-by: Ganesh Vernekar <ganeshvern@gmail.com>
* Fix review comments
Signed-off-by: Ganesh Vernekar <ganeshvern@gmail.com>
Signed-off-by: Jesus Vazquez <jesus.vazquez@grafana.com>
Signed-off-by: Ganesh Vernekar <ganeshvern@gmail.com>
Signed-off-by: Ganesh Vernekar 15064823+codesome@users.noreply.github.com
Co-authored-by: Ganesh Vernekar <15064823+codesome@users.noreply.github.com>
Co-authored-by: Ganesh Vernekar <ganeshvern@gmail.com>
Co-authored-by: Dieter Plaetinck <dieter@grafana.com>
Co-authored-by: Oleg Zaytsev <mail@olegzaytsev.com>
Co-authored-by: Bartlomiej Plotka <bwplotka@gmail.com>
The chunk pool belongs to the head not to the series. Pass it down where
required, and remove the copy of the pointer that `memSeries` was
holding.
`safeChunk` also needs to hold it, because in scenarios where it is used
we don't have a reference to the head. However it was already holding
`chunkDiskMapper` for the same reason, so no big change.
Signed-off-by: Bryan Boreham <bjboreham@gmail.com>
Signed-off-by: Bryan Boreham <bjboreham@gmail.com>
* Add runtime config to control native histogram ingestion
Signed-off-by: Ganesh Vernekar <ganeshvern@gmail.com>
* Make the config into a CLI flag
Signed-off-by: Ganesh Vernekar <ganeshvern@gmail.com>
Signed-off-by: Ganesh Vernekar <ganeshvern@gmail.com>
Metadata was added recently but doesn't seem to be used much, at least as far as I could identify.
Yet it's part of memSeries struct and so even when empty takes 48 bytes,
which is a lot given that without it memSeries requires 224 bytes.
This change turns it into a pointer on the struct, that get set only when metadata is actually set of given series.
Signed-off-by: Łukasz Mierzwa <l.mierzwa@gmail.com>
Signed-off-by: Łukasz Mierzwa <l.mierzwa@gmail.com>
* Append metadata to the WAL
Signed-off-by: Paschalis Tsilias <paschalist0@gmail.com>
* Remove extra whitespace; Reword some docstrings and comments
Signed-off-by: Paschalis Tsilias <paschalist0@gmail.com>
* Use RLock() for hasNewMetadata check
Signed-off-by: Paschalis Tsilias <paschalist0@gmail.com>
* Use single byte for metric type in RefMetadata
Signed-off-by: Paschalis Tsilias <paschalist0@gmail.com>
* Update proposed WAL format for single-byte type metadata
Signed-off-by: Paschalis Tsilias <paschalist0@gmail.com>
* Implementa MetadataAppender interface for the Agent
Signed-off-by: Paschalis Tsilias <paschalist0@gmail.com>
* Address first round of review comments
Signed-off-by: Paschalis Tsilias <paschalist0@gmail.com>
* Amend description of metadata in wal.md
Signed-off-by: Paschalis Tsilias <paschalist0@gmail.com>
* Correct key used to retrieve metadata from cache
When we're setting metadata entries in the scrapeCace, we're using the
p.Help(), p.Unit(), p.Type() helpers, which retrieve the series name and
use it as the cache key. When checking for cache entries though, we used
p.Series() as the key, which included the metric name _with_ its labels.
That meant that we were never actually hitting the cache. We're fixing
this by utiling the __name__ internal label for correctly getting the
cache entries after they've been set by setHelp(), setType() or
setUnit().
Signed-off-by: Paschalis Tsilias <paschalist0@gmail.com>
* Put feature behind a feature flag
Signed-off-by: Paschalis Tsilias <paschalist0@gmail.com>
* Fix AppendMetadata docstring
Signed-off-by: Paschalis Tsilias <paschalist0@gmail.com>
* Reorder WAL format document
Signed-off-by: Paschalis Tsilias <paschalist0@gmail.com>
* Change error message of AppendMetadata; Fix access of s.meta in AppendMetadata
Signed-off-by: Paschalis Tsilias <paschalist0@gmail.com>
* Reuse temporary buffer in Metadata encoder
Signed-off-by: Paschalis Tsilias <paschalist0@gmail.com>
* Only keep latest metadata for each refID during checkpointing
Signed-off-by: Paschalis Tsilias <paschalist0@gmail.com>
* Fix test that's referencing decoding metadata
Signed-off-by: Paschalis Tsilias <paschalist0@gmail.com>
* Avoid creating metadata block if no new metadata are present
Signed-off-by: Paschalis Tsilias <paschalist0@gmail.com>
* Add tests for corrupt metadata block and relevant record type
Signed-off-by: Paschalis Tsilias <paschalist0@gmail.com>
* Fix CR comments
Signed-off-by: Paschalis Tsilias <paschalist0@gmail.com>
* Extract logic about changing metadata in an anonymous function
Signed-off-by: Paschalis Tsilias <paschalist0@gmail.com>
* Implement new proposed WAL format and amend relevant tests
Signed-off-by: Paschalis Tsilias <paschalist0@gmail.com>
* Use 'const' for metadata field names
Signed-off-by: Paschalis Tsilias <paschalist0@gmail.com>
* Apply metadata to head memSeries in Commit, not in AppendMetadata
Signed-off-by: Paschalis Tsilias <paschalist0@gmail.com>
* Add docstring and rename extracted helper in scrape.go
Signed-off-by: Paschalis Tsilias <paschalist0@gmail.com>
* Add tests for tsdb-related cases
Signed-off-by: Paschalis Tsilias <paschalist0@gmail.com>
* Fix linter issues vol1
Signed-off-by: Paschalis Tsilias <paschalist0@gmail.com>
* Fix linter issues vol2
Signed-off-by: Paschalis Tsilias <paschalist0@gmail.com>
* Fix Windows test by closing WAL reader files
Signed-off-by: Paschalis Tsilias <paschalist0@gmail.com>
* Use switch instead of two if statements in metadata decoding
Signed-off-by: Paschalis Tsilias <paschalist0@gmail.com>
* Fix review comments around TestMetadata* tests
Signed-off-by: Paschalis Tsilias <paschalist0@gmail.com>
* Add code for replaying WAL; test correctness of in-memory data after a replay
Signed-off-by: Paschalis Tsilias <paschalist0@gmail.com>
* Remove scrape-loop related code from PR
Signed-off-by: Paschalis Tsilias <paschalist0@gmail.com>
* Address first round of comments
Signed-off-by: Paschalis Tsilias <paschalist0@gmail.com>
* Simplify tests by sorting slices before comparison
Signed-off-by: Paschalis Tsilias <paschalist0@gmail.com>
* Fix test to use separate transactions
Signed-off-by: Paschalis Tsilias <paschalist0@gmail.com>
* Empty out buffer and record slices after encoding latest metadata
Signed-off-by: Paschalis Tsilias <paschalist0@gmail.com>
* Fix linting issue
Signed-off-by: Paschalis Tsilias <paschalist0@gmail.com>
* Update calculation for DroppedMetadata metric
Signed-off-by: Paschalis Tsilias <paschalist0@gmail.com>
* Rename MetadataAppender interface and AppendMetadata method to MetadataUpdater/UpdateMetadata
Signed-off-by: Paschalis Tsilias <paschalist0@gmail.com>
* Reuse buffer when encoding latest metadata for each series
Signed-off-by: Paschalis Tsilias <paschalist0@gmail.com>
* Fix review comments; Check all returned error values using two helpers
Signed-off-by: Paschalis Tsilias <paschalist0@gmail.com>
* Simplify use of helpers
Signed-off-by: Paschalis Tsilias <paschalist0@gmail.com>
* Satisfy linter
Signed-off-by: Paschalis Tsilias <paschalist0@gmail.com>
This implementation is based on this design doc:
https://docs.google.com/document/d/1Kppm7qL9C-BJB1j6yb6-9ObG3AbdZnFUBYPNNWwDBYM/edit?usp=sharing
This commit adds support to accept out-of-order ("OOO") sample into the TSDB
up to a configurable time allowance. If OOO is enabled, overlapping querying
are automatically enabled.
Signed-off-by: Ganesh Vernekar <ganeshvern@gmail.com>
Co-authored-by: Jesus Vazquez <jesus.vazquez@grafana.com>
Co-authored-by: Ganesh Vernekar <ganeshvern@gmail.com>
Co-authored-by: Dieter Plaetinck <dieter@grafana.com>
* Chunks replay skips chunks with unknown encodings
We've changed the logic of loadMmappedChunks to skip chunks that have
unknown encodings. To do so we've modified IterateAllChunks to accept an
extra encoding argument in the callback function.
Also added unit tests in the head and chunk disk mapper.
* Also add an unit test for the old chunk diskmapper
* s/createUnsupportedChunk/writeUnsupportedChunk/g
* Write chunks via queue, predicting the refs
Our load tests have shown that there is a latency spike in the
remote write handler whenever the head chunks need to be written,
because chunkDiskMapper.WriteChunk() blocks until the chunks are written
to disk.
This adds a queue to the chunk disk mapper which makes the WriteChunk()
method non-blocking unless the queue is full. Reads can still be served
from the queue.
Signed-off-by: Mauro Stettler <mauro.stettler@gmail.com>
* address PR feeddback
Signed-off-by: Mauro Stettler <mauro.stettler@gmail.com>
* initialize metrics without .Add(0)
Signed-off-by: Mauro Stettler <mauro.stettler@gmail.com>
* change isRunningMtx to normal lock
Signed-off-by: Mauro Stettler <mauro.stettler@gmail.com>
* do not re-initialize chunkrefmap
Signed-off-by: Mauro Stettler <mauro.stettler@gmail.com>
* update metric outside of lock scope
Signed-off-by: Mauro Stettler <mauro.stettler@gmail.com>
* add benchmark for adding job to chunk write queue
Signed-off-by: Mauro Stettler <mauro.stettler@gmail.com>
* remove unnecessary "success" var
Signed-off-by: Mauro Stettler <mauro.stettler@gmail.com>
* gofumpt -extra
Signed-off-by: Mauro Stettler <mauro.stettler@gmail.com>
* avoid WithLabelValues call in addJob
Signed-off-by: Mauro Stettler <mauro.stettler@gmail.com>
* format comments
Signed-off-by: Mauro Stettler <mauro.stettler@gmail.com>
* addressing PR feedback
Signed-off-by: Mauro Stettler <mauro.stettler@gmail.com>
* rename cutExpectRef to cutAndExpectRef
Signed-off-by: Mauro Stettler <mauro.stettler@gmail.com>
* use head.Init() instead of .initTime()
Signed-off-by: Mauro Stettler <mauro.stettler@gmail.com>
* address PR feedback
Signed-off-by: Mauro Stettler <mauro.stettler@gmail.com>
* PR feedback
Co-authored-by: Ganesh Vernekar <15064823+codesome@users.noreply.github.com>
Signed-off-by: Mauro Stettler <mauro.stettler@gmail.com>
* update test according to PR feedback
Signed-off-by: Mauro Stettler <mauro.stettler@gmail.com>
* replace callbackWg -> awaitCb
Signed-off-by: Mauro Stettler <mauro.stettler@gmail.com>
* better test of truncation with empty files
Signed-off-by: Mauro Stettler <mauro.stettler@gmail.com>
* replace callbackWg -> awaitCb
Signed-off-by: Mauro Stettler <mauro.stettler@gmail.com>
Co-authored-by: Ganesh Vernekar <15064823+codesome@users.noreply.github.com>
* write chunks via queue, predicting chunkrefs
* fixes
* linter
* cleanup
* better test coverage
* deduplicate operations when cutting file
* better comments
* fix race
* improvements based on PR feedback
* comments
* concurrent correctness test of write queue
* linter
* use isStarted flag in chunk write queue
* separate mutex for isStarted
* fixing tests
* fix race in test
* PR feedback
* add license headers
* PR feedback
* pr feedback
* dont recreate workerCtrl channel
* address PR feedback in high concurrency read and write test
* refactor the high concurrency write queue test
* pr feedback
* use require.Eventually
* typo
* Fixing comment
Co-authored-by: Peter Štibraný <peter.stibrany@grafana.com>
* use buffered chan for chunk write queue
* fix races
* address feedback on PR
* better comment
* less type casts
* dont reeimplement varint
* move mtx above property it protects
* improve tests by using wg instead of inspecting queue
* add comment
* add comment
* writeChunkF comment
* rename isStarted -> isRunning
* prevent panic when double stopping
* fix variable name in comment
* delete outdated comment
* naming and comment fixes in TestChunkWriteQueue_WrappingAroundSizeLimit
* use require.False instead of require.Equal(t, false
* always enable chunk write queue
* dont call callback with lock held
* undo rename of curFileSequence to curFileSeq
* fix accidental change
* fix comment
* Better syntax
Co-authored-by: Peter Štibraný <peter.stibrany@grafana.com>
* Better wording in comment
Co-authored-by: Peter Štibraný <peter.stibrany@grafana.com>
* better comments to explain the use of coordinator chans in test
* fix test which broke to the async writing of chunks
* undo previous renaming of shadowing variable
* rename var threadID to workerID
* rename variables based on PR feedback
* rename pos -> ts
* better implementation of whileNotCanceled
* update querySeriesRef before using it
* rename labels -> lbls
* initialize the chunk write queue operations metric with all possible labels
* dont return error from CutNewFile
* recreate chunk ref map regularly to free
* limit how often we recreate the chunk ref map
* fix typo in comment
* better comment
Co-authored-by: Peter Štibraný <peter.stibrany@grafana.com>
The atomic Int64 type wasn't able to be represented when logging
via go-kit log and ended up as `"unsupported value type"`.
Signed-off-by: Nick Pillitteri <nick.pillitteri@grafana.com>
* track OOO-ness in tsdb into a histogram
This should help with building an understanding of customer requirements
as far as how far back samples tend to go.
Note: this has already been discussed, reviewed and approved on:
https://github.com/grafana/mimir/pull/488
Signed-off-by: Dieter Plaetinck <dieter@grafana.com>
* update the test files which were not in mimir/vendor
- Pick At... method via return value of Next/Seek.
- Do not clobber returned buckets.
- Add partial FloatHistogram suppert.
Note that the promql package is now _only_ dealing with
FloatHistograms, following the idea that PromQL only knows float
values.
As a byproduct, I have removed the histogramSeries metric. In my
understanding, series can have both float and histogram samples, so
that metric doesn't make sense anymore.
As another byproduct, I have converged the sampleBuf and the
histogramSampleBuf in memSeries into one. The sample type stored in
the sampleBuf has been extended to also contain histograms even before
this commit.
Signed-off-by: beorn7 <beorn@grafana.com>