This adds negative buckets and access of float histograms to
TestHistogramChunkSameBuckets and TestHistogramChunkBucketChanges.
It also exercises a specific pattern of reusing an iterator (one where
no access has happened).
This exposes two bugs (where entries for positive buckets where used
where the corresponding entries for negative buckets should have been
used). One was fixed in #11627 (not merged), which triggered the work
in this commit.
This commit fixes both issues, so #11627 can be closed.
It also simplifies the code in the histogramIterator.Next method that
aims to recycle existing slice capacity.
Furthermore, this is on top of the release-2.40 branch because we
should probably cut a bugfix release for this.
Signed-off-by: beorn7 <beorn@grafana.com>
Typical parameters are one hour by 1 minute step, where the
function would allocate a slice of 3.6 million samples instead of 60.
Signed-off-by: Bryan Boreham <bjboreham@gmail.com>
* Switch from 'sanity' to more inclusive lanuage
"Removing ableist language in code is important; it helps to create and
maintain an environment that welcomes all developers of all backgrounds,
while emphasizing that we as developers select the most articulate,
precise, descriptive language we can rather than relying on metaphors.
The phrase sanity check is ableist, and unnecessarily references mental
health in our code bases. It denotes that people with mental illnesses
are inferior, wrong, or incorrect, and the phrase sanity continues to be
used by employers and other individuals to discriminate against these
people."
From https://gist.github.com/seanmhanson/fe370c2d8bd2b3228680e38899baf5cc
Signed-off-by: Bryan Boreham <bjboreham@gmail.com>
Inverting the test for chunks deleted by tombstones makes all three
rejections consistent, and also avoids the case where a chunk is
excluded but still causes `trimFront` or `trimBack` to be set.
Signed-off-by: Bryan Boreham <bjboreham@gmail.com>
This reduces garbage, hence goes faster, when a short time range is
required compared to the amount of chunks in the block. For example
recording rules and alerts often look only at the last few minutes.
Signed-off-by: Bryan Boreham <bjboreham@gmail.com>
* Add BenchmarkOpenBlock
* Use specific types when reading offset table
Instead of reading a generic-ish []string, we can read a generic type
which would be specifically labels.Label.
This avoid allocating a slice that escapes to the heap, making it both
faster and more efficient in terms of memory management.
* Update error message for unexpected number of keys
* s/posting offset table/postings offset table/
* Remove useless lastKey assignment
* Use two []bytes vars, simplify
Applied PR feedback: removed generics, moved the label indices reading
to that specific test as we're not using it in production anyway, we're
just testing what we've just built.
Also using two []bytes variables for name and value that use the backing
buffer instead of using strings, this reduces allocations a lot as we
only copy them when we store them (this is optimized by the compiler).
* Fix the dumb bug
Signed-off-by: Oleg Zaytsev <mail@olegzaytsev.com>
Co-authored-by: Marco Pracucci <marco@pracucci.com>
Function arguments in defer evaluated during definition of defer, not
during execution
Signed-off-by: Slavik Panasovets <slavik@google.com>
Signed-off-by: Slavik Panasovets <slavik@google.com>
With these changes, the "happy path" when the leading and trailing
number of bits don't need an update, fewer operations are needed.
The change is probably very marginal (no change in the benchmark added
here, but the benchmark also doesn't cover non-changing values), and
an argument could me made that avoiding pointers also has its
benefits.
However, I think that reducing the number of return values improves
readability. Which convinced me that I should at least propose this.
Signed-off-by: beorn7 <beorn@grafana.com>
The wlog.WL type can now be used to create a Write Ahead Log or a Write
Behind Log.
Before the prefix for wbl metrics was
'prometheus_tsdb_out_of_order_wal_' and has been replaced with
'prometheus_tsdb_out_of_order_wbl_'.
Signed-off-by: Jesus Vazquez <jesus.vazquez@grafana.com>
Signed-off-by: Jesus Vazquez <jesusvazquez@users.noreply.github.com>
Co-authored-by: Ganesh Vernekar <15064823+codesome@users.noreply.github.com>
This call was added by PR #11075 merged before #11318 which changed all
similar calls to `sort.Sort` into a faster one.
Signed-off-by: Bryan Boreham <bjboreham@gmail.com>
Signed-off-by: Bryan Boreham <bjboreham@gmail.com>
Use new experimental package `golang.org/x/exp/slices`.
slices.Sort works on values that are directly comparable, like ints,
so avoids the overhad of an interface call to `.Less()`.
Left tests unchanged, because they don't need the speed and it may be
a cross-check that slices.Sort gives the same answer.
Signed-off-by: Bryan Boreham <bjboreham@gmail.com>
Use new experimental package `golang.org/x/exp/slices`.
Some of the speedup comes from comparing SeriesRef (which is an int64)
directly rather than through an interface `.Less()` call; some comes
from exp/slices using "pattern-defeating quicksort(pdqsort)".
Signed-off-by: Bryan Boreham <bjboreham@gmail.com>
Signed-off-by: Bryan Boreham <bjboreham@gmail.com>
* tsdb/agent: fix application of defaults
MaxTS was being incorrectly constrained to the truncation interval
* add more tests to check validation
* force MaxWALTime = MinWALTime if min > max
Signed-off-by: Robert Fratto <robertfratto@gmail.com>
* TSDB chunks: remove race between writing and reading
Because the data is stored as a bit-stream, the last byte in the stream
could change if the stream is appended to after an Iterator is obtained.
Copy the last byte when the Iterator is created, so we don't have to
read it later.
Clarify in comments that concurrent Iterator and Appender are allowed,
but the chunk must not be modified while an Iterator is created.
(This was already the case, in order to copy the bstream slice header.)
* TSDB: stop saving last 4 samples in memSeries
This extra copy of the last 4 samples was introduced to avoid a race
condition between reading the last byte of the chunk and writing to it.
But now we have fixed that by having `bstreamReader` copy the last byte,
we don't need to copy the last 4 samples.
This change saves 56 bytes per series, which is very worthwhile when
you have millions or tens of millions of series.
* TSDB: tidy up stopIterator re-use
Previous changes have left this code duplicating some lines; pull
them out to a separate function and tidy up.
* TSDB head_test: stop checking when iterators are wrapped
The behaviour has changed so chunk iterators are only wrapped when
transaction isolation requires them to stop short of the end.
This makes tests fail which are checking the type.
Tests should check the observable behaviour, not the type.
Signed-off-by: Bryan Boreham <bjboreham@gmail.com>
Signed-off-by: Ganesh Vernekar <ganeshvern@gmail.com>
Co-authored-by: Ganesh Vernekar <ganeshvern@gmail.com>
* tsdb: add a basic test for read/write isolation
* tsdb: store the min time with isolationAppender
So that we can see when appending has moved past a certain point in time.
* tsdb: allow RangeHead to have isolation disabled
This will be used when for head compaction.
* tsdb: do head compaction with isolation disabled
This saves a lot of work tracking appends done while compaction is ongoing.
Signed-off-by: Bryan Boreham <bjboreham@gmail.com>
* tsdb: remove chunkRange from memSeries
chunkRange is the (oddly-named) configured duration for the head block.
We don't need a copy of this value per series. Pass it down where
required, and remove the copy.
The value in `Head` is only updated in `resetInMemoryState()`, which
also discards all `memSeries`.
* tsdb: remove oooCapMax from memSeries
oooCapMax is the configured maximum capacity for an out-of-order chunk.
Storing it per-series uses extra memory, and has surprising behaviour
if users change the value in config - series created before the change
will keep their old value.
Instead, pass it down where required, and remove the per-series value.
Signed-off-by: Bryan Boreham <bjboreham@gmail.com>
Signed-off-by: Ganesh Vernekar <ganeshvern@gmail.com>
Co-authored-by: Ganesh Vernekar <ganeshvern@gmail.com>
* Introduce out-of-order TSDB support
This implementation is based on this design doc:
https://docs.google.com/document/d/1Kppm7qL9C-BJB1j6yb6-9ObG3AbdZnFUBYPNNWwDBYM/edit?usp=sharing
This commit adds support to accept out-of-order ("OOO") sample into the TSDB
up to a configurable time allowance. If OOO is enabled, overlapping querying
are automatically enabled.
Most of the additions have been borrowed from
https://github.com/grafana/mimir-prometheus/
Here is the list ist of the original commits cherry picked
from mimir-prometheus into this branch:
- 4b2198d7ec
- 2836e5513f
- 00b379c3a5
- ff0dc75758
- a632c73352
- c6f3d4ab33
- 5e8406a1d4
- abde1e0ba1
- e70e769889
- df59320886
Co-authored-by: Jesus Vazquez <jesus.vazquez@grafana.com>
Co-authored-by: Ganesh Vernekar <ganeshvern@gmail.com>
Co-authored-by: Dieter Plaetinck <dieter@grafana.com>
Signed-off-by: Jesus Vazquez <jesus.vazquez@grafana.com>
* gofumpt files
Signed-off-by: Jesus Vazquez <jesus.vazquez@grafana.com>
* Add license header to missing files
Signed-off-by: Jesus Vazquez <jesus.vazquez@grafana.com>
* Fix OOO tests due to existing chunk disk mapper implementation
Signed-off-by: Jesus Vazquez <jesus.vazquez@grafana.com>
* Fix truncate int overflow
Signed-off-by: Jesus Vazquez <jesus.vazquez@grafana.com>
* Add Sync method to the WAL and update tests
Signed-off-by: Jesus Vazquez <jesus.vazquez@grafana.com>
* remove useless sync
Signed-off-by: Jesus Vazquez <jesus.vazquez@grafana.com>
* Update minOOOTime after truncating Head
* Update minOOOTime after truncating Head
Signed-off-by: Ganesh Vernekar <ganeshvern@gmail.com>
* Fix lint
Signed-off-by: Ganesh Vernekar <ganeshvern@gmail.com>
* Add a unit test
Signed-off-by: Ganesh Vernekar <ganeshvern@gmail.com>
Signed-off-by: Jesus Vazquez <jesus.vazquez@grafana.com>
* Load OutOfOrderTimeWindow only once per appender
Signed-off-by: Jesus Vazquez <jesus.vazquez@grafana.com>
* Fix OOO Head LabelValues and PostingsForMatchers
Signed-off-by: Jesus Vazquez <jesus.vazquez@grafana.com>
* Fix replay of OOO mmap chunks
Signed-off-by: Ganesh Vernekar <ganeshvern@gmail.com>
* Remove unnecessary err check
Signed-off-by: Jesus Vazquez <jesus.vazquez@grafana.com>
* Prevent panic with ApplyConfig
Signed-off-by: Ganesh Vernekar 15064823+codesome@users.noreply.github.com
Signed-off-by: Jesus Vazquez <jesus.vazquez@grafana.com>
* Run OOO compaction after restart if there is OOO data from WBL
Signed-off-by: Ganesh Vernekar 15064823+codesome@users.noreply.github.com
Signed-off-by: Jesus Vazquez <jesus.vazquez@grafana.com>
* Apply Bartek's suggestions
Co-authored-by: Bartlomiej Plotka <bwplotka@gmail.com>
Signed-off-by: Jesus Vazquez <jesus.vazquez@grafana.com>
* Refactor OOO compaction
Signed-off-by: Ganesh Vernekar <ganeshvern@gmail.com>
* Address comments and TODOs
- Added a comment explaining why we need the allow overlapping
compaction toggle
- Clarified TSDBConfig OutOfOrderTimeWindow doc
- Added an owner to all the TODOs in the code
Signed-off-by: Jesus Vazquez <jesus.vazquez@grafana.com>
* Run go format
Signed-off-by: Jesus Vazquez <jesus.vazquez@grafana.com>
* Fix remaining review comments
Signed-off-by: Ganesh Vernekar <ganeshvern@gmail.com>
* Fix tests
Signed-off-by: Ganesh Vernekar <ganeshvern@gmail.com>
* Change wbl reference when truncating ooo in TestHeadMinOOOTimeUpdate
Signed-off-by: Jesus Vazquez <jesus.vazquez@grafana.com>
* Fix TestWBLAndMmapReplay test failure on windows
Signed-off-by: Ganesh Vernekar <ganeshvern@gmail.com>
* Address most of the feedback
Signed-off-by: Ganesh Vernekar <ganeshvern@gmail.com>
* Refactor the block meta for out of order
Signed-off-by: Ganesh Vernekar <ganeshvern@gmail.com>
* Fix windows error
Signed-off-by: Ganesh Vernekar <ganeshvern@gmail.com>
* Fix review comments
Signed-off-by: Ganesh Vernekar <ganeshvern@gmail.com>
Signed-off-by: Jesus Vazquez <jesus.vazquez@grafana.com>
Signed-off-by: Ganesh Vernekar <ganeshvern@gmail.com>
Signed-off-by: Ganesh Vernekar 15064823+codesome@users.noreply.github.com
Co-authored-by: Ganesh Vernekar <15064823+codesome@users.noreply.github.com>
Co-authored-by: Ganesh Vernekar <ganeshvern@gmail.com>
Co-authored-by: Dieter Plaetinck <dieter@grafana.com>
Co-authored-by: Oleg Zaytsev <mail@olegzaytsev.com>
Co-authored-by: Bartlomiej Plotka <bwplotka@gmail.com>
If some shards did not get any samples mapped, the buffer will be empty
so sending it over the chan to `processWALSamples()` is a waste of time.
This is especially likely now we are checking `minValidTime` before
sending.
Signed-off-by: Bryan Boreham <bjboreham@gmail.com>
Signed-off-by: Bryan Boreham <bjboreham@gmail.com>
The chunk pool belongs to the head not to the series. Pass it down where
required, and remove the copy of the pointer that `memSeries` was
holding.
`safeChunk` also needs to hold it, because in scenarios where it is used
we don't have a reference to the head. However it was already holding
`chunkDiskMapper` for the same reason, so no big change.
Signed-off-by: Bryan Boreham <bjboreham@gmail.com>
Signed-off-by: Bryan Boreham <bjboreham@gmail.com>
There's no point splitting a sample onto the right shard and checking
if the series needs to be re-mapped, if we're only going to discard it
once it arrives at `ProcessWALSamples()`. Simply discard it earlier.
Signed-off-by: Bryan Boreham <bjboreham@gmail.com>
Signed-off-by: Bryan Boreham <bjboreham@gmail.com>
* Add runtime config to control native histogram ingestion
Signed-off-by: Ganesh Vernekar <ganeshvern@gmail.com>
* Make the config into a CLI flag
Signed-off-by: Ganesh Vernekar <ganeshvern@gmail.com>
Signed-off-by: Ganesh Vernekar <ganeshvern@gmail.com>
I find it useful to know why a restriction exists, to check whether that
reason still applies, or in which other places it might apply.
This is based on the note here:
https://github.com/prometheus/prometheus/pull/9270#pullrequestreview-743820956
on the PR where the original comment was added.
Signed-off-by: Bryan Boreham <bjboreham@gmail.com>
Signed-off-by: Bryan Boreham <bjboreham@gmail.com>
* Update go to 1.19, set min version to 1.18
Signed-off-by: Julien Pivotto <roidelapluie@o11y.eu>
* Update golangci-lint
Signed-off-by: Julien Pivotto <roidelapluie@o11y.eu>
Signed-off-by: Julien Pivotto <roidelapluie@o11y.eu>
Metadata was added recently but doesn't seem to be used much, at least as far as I could identify.
Yet it's part of memSeries struct and so even when empty takes 48 bytes,
which is a lot given that without it memSeries requires 224 bytes.
This change turns it into a pointer on the struct, that get set only when metadata is actually set of given series.
Signed-off-by: Łukasz Mierzwa <l.mierzwa@gmail.com>
Signed-off-by: Łukasz Mierzwa <l.mierzwa@gmail.com>
* Add histogram validation
Signed-off-by: Levi Harrison <git@leviharrison.dev>
* Correct negative offset validation
Signed-off-by: Levi Harrison <git@leviharrison.dev>
* Address review comments
Signed-off-by: Levi Harrison <git@leviharrison.dev>
* Validation benchmark
Signed-off-by: Levi Harrison <git@leviharrison.dev>
* Add more checks
Signed-off-by: Levi Harrison <git@leviharrison.dev>
* Attempt to fix tests
Signed-off-by: Levi Harrison <git@leviharrison.dev>
* Fix stuff
Signed-off-by: Levi Harrison <git@leviharrison.dev>
* tsdb/record: Extract functions to encode and decode labels
Signed-off-by: Bryan Boreham <bjboreham@gmail.com>
* tsdb: make use of Encode/Decode Labels
Simplify the code by re-using routines from tsdb/record.
Signed-off-by: Bryan Boreham <bjboreham@gmail.com>
* Append metadata to the WAL
Signed-off-by: Paschalis Tsilias <paschalist0@gmail.com>
* Remove extra whitespace; Reword some docstrings and comments
Signed-off-by: Paschalis Tsilias <paschalist0@gmail.com>
* Use RLock() for hasNewMetadata check
Signed-off-by: Paschalis Tsilias <paschalist0@gmail.com>
* Use single byte for metric type in RefMetadata
Signed-off-by: Paschalis Tsilias <paschalist0@gmail.com>
* Update proposed WAL format for single-byte type metadata
Signed-off-by: Paschalis Tsilias <paschalist0@gmail.com>
* Implementa MetadataAppender interface for the Agent
Signed-off-by: Paschalis Tsilias <paschalist0@gmail.com>
* Address first round of review comments
Signed-off-by: Paschalis Tsilias <paschalist0@gmail.com>
* Amend description of metadata in wal.md
Signed-off-by: Paschalis Tsilias <paschalist0@gmail.com>
* Correct key used to retrieve metadata from cache
When we're setting metadata entries in the scrapeCace, we're using the
p.Help(), p.Unit(), p.Type() helpers, which retrieve the series name and
use it as the cache key. When checking for cache entries though, we used
p.Series() as the key, which included the metric name _with_ its labels.
That meant that we were never actually hitting the cache. We're fixing
this by utiling the __name__ internal label for correctly getting the
cache entries after they've been set by setHelp(), setType() or
setUnit().
Signed-off-by: Paschalis Tsilias <paschalist0@gmail.com>
* Put feature behind a feature flag
Signed-off-by: Paschalis Tsilias <paschalist0@gmail.com>
* Fix AppendMetadata docstring
Signed-off-by: Paschalis Tsilias <paschalist0@gmail.com>
* Reorder WAL format document
Signed-off-by: Paschalis Tsilias <paschalist0@gmail.com>
* Change error message of AppendMetadata; Fix access of s.meta in AppendMetadata
Signed-off-by: Paschalis Tsilias <paschalist0@gmail.com>
* Reuse temporary buffer in Metadata encoder
Signed-off-by: Paschalis Tsilias <paschalist0@gmail.com>
* Only keep latest metadata for each refID during checkpointing
Signed-off-by: Paschalis Tsilias <paschalist0@gmail.com>
* Fix test that's referencing decoding metadata
Signed-off-by: Paschalis Tsilias <paschalist0@gmail.com>
* Avoid creating metadata block if no new metadata are present
Signed-off-by: Paschalis Tsilias <paschalist0@gmail.com>
* Add tests for corrupt metadata block and relevant record type
Signed-off-by: Paschalis Tsilias <paschalist0@gmail.com>
* Fix CR comments
Signed-off-by: Paschalis Tsilias <paschalist0@gmail.com>
* Extract logic about changing metadata in an anonymous function
Signed-off-by: Paschalis Tsilias <paschalist0@gmail.com>
* Implement new proposed WAL format and amend relevant tests
Signed-off-by: Paschalis Tsilias <paschalist0@gmail.com>
* Use 'const' for metadata field names
Signed-off-by: Paschalis Tsilias <paschalist0@gmail.com>
* Apply metadata to head memSeries in Commit, not in AppendMetadata
Signed-off-by: Paschalis Tsilias <paschalist0@gmail.com>
* Add docstring and rename extracted helper in scrape.go
Signed-off-by: Paschalis Tsilias <paschalist0@gmail.com>
* Add tests for tsdb-related cases
Signed-off-by: Paschalis Tsilias <paschalist0@gmail.com>
* Fix linter issues vol1
Signed-off-by: Paschalis Tsilias <paschalist0@gmail.com>
* Fix linter issues vol2
Signed-off-by: Paschalis Tsilias <paschalist0@gmail.com>
* Fix Windows test by closing WAL reader files
Signed-off-by: Paschalis Tsilias <paschalist0@gmail.com>
* Use switch instead of two if statements in metadata decoding
Signed-off-by: Paschalis Tsilias <paschalist0@gmail.com>
* Fix review comments around TestMetadata* tests
Signed-off-by: Paschalis Tsilias <paschalist0@gmail.com>
* Add code for replaying WAL; test correctness of in-memory data after a replay
Signed-off-by: Paschalis Tsilias <paschalist0@gmail.com>
* Remove scrape-loop related code from PR
Signed-off-by: Paschalis Tsilias <paschalist0@gmail.com>
* Address first round of comments
Signed-off-by: Paschalis Tsilias <paschalist0@gmail.com>
* Simplify tests by sorting slices before comparison
Signed-off-by: Paschalis Tsilias <paschalist0@gmail.com>
* Fix test to use separate transactions
Signed-off-by: Paschalis Tsilias <paschalist0@gmail.com>
* Empty out buffer and record slices after encoding latest metadata
Signed-off-by: Paschalis Tsilias <paschalist0@gmail.com>
* Fix linting issue
Signed-off-by: Paschalis Tsilias <paschalist0@gmail.com>
* Update calculation for DroppedMetadata metric
Signed-off-by: Paschalis Tsilias <paschalist0@gmail.com>
* Rename MetadataAppender interface and AppendMetadata method to MetadataUpdater/UpdateMetadata
Signed-off-by: Paschalis Tsilias <paschalist0@gmail.com>
* Reuse buffer when encoding latest metadata for each series
Signed-off-by: Paschalis Tsilias <paschalist0@gmail.com>
* Fix review comments; Check all returned error values using two helpers
Signed-off-by: Paschalis Tsilias <paschalist0@gmail.com>
* Simplify use of helpers
Signed-off-by: Paschalis Tsilias <paschalist0@gmail.com>
* Satisfy linter
Signed-off-by: Paschalis Tsilias <paschalist0@gmail.com>
* Add exemplar record case
Signed-off-by: Levi Harrison <git@leviharrison.dev>
* recordType() -> Type.String()
Signed-off-by: Levi Harrison <git@leviharrison.dev>
Previously, the maxTime wasn't updated properly in case of a recoding
happening.
My apologies for reformatting many lines for line length. During the
bug hunt, I tried to make things more readable in a reasonably wide
editor window.
Signed-off-by: beorn7 <beorn@grafana.com>
Previously, the maxTime wasn't updated properly in case of a recoding
happening.
My apologies for reformatting many lines for line length. During the
bug hunt, I tried to make things more readable in a reasonably wide
editor window.
Signed-off-by: beorn7 <beorn@grafana.com>
Relates to @bboreham optimization in https://github.com/prometheus/prometheus/pull/10859
Bryan did reduce the sleep time improving the deltas on the benchmark by
quite a lot. However I've been working on a similar implementation for
out of order and I noticed that we actually get into this method
thousands of times.
@ywwg had the brilliant idea of not always sleeping before the select
but actually make it a case in the select so we only sleep if we need
to.
The benchmark deltas are amazing
```
❯ benchstat old_implementation.txt new_implementation_using_time_after.txt
name old time/op new time/op delta
LoadWAL/batches=10,seriesPerBatch=100,samplesPerSeries=7200,exemplarsPerSeries=0,mmappedChunkT=0-8 521ms ±25% 253ms ± 6% -51.47% (p=0.008 n=5+5)
LoadWAL/batches=10,seriesPerBatch=100,samplesPerSeries=7200,exemplarsPerSeries=36,mmappedChunkT=0-8 773ms ± 3% 369ms ±31% -52.23% (p=0.008 n=5+5)
LoadWAL/batches=10,seriesPerBatch=100,samplesPerSeries=7200,exemplarsPerSeries=72,mmappedChunkT=0-8 592ms ±28% 297ms ±28% -49.80% (p=0.008 n=5+5)
LoadWAL/batches=10,seriesPerBatch=100,samplesPerSeries=7200,exemplarsPerSeries=360,mmappedChunkT=0-8 547ms ± 2% 999ms ±187% ~ (p=0.690 n=5+5)
LoadWAL/batches=10,seriesPerBatch=10000,samplesPerSeries=50,exemplarsPerSeries=0,mmappedChunkT=0-8 11.3s ± 4% 1.3s ±44% -88.48% (p=0.008 n=5+5)
LoadWAL/batches=10,seriesPerBatch=10000,samplesPerSeries=50,exemplarsPerSeries=2,mmappedChunkT=0-8 11.1s ± 1% 1.2s ±20% -89.08% (p=0.008 n=5+5)
LoadWAL/batches=10,seriesPerBatch=1000,samplesPerSeries=480,exemplarsPerSeries=0,mmappedChunkT=0-8 1.24s ± 3% 0.18s ± 7% -85.76% (p=0.008 n=5+5)
LoadWAL/batches=10,seriesPerBatch=1000,samplesPerSeries=480,exemplarsPerSeries=2,mmappedChunkT=0-8 1.24s ± 2% 0.18s ± 5% -85.24% (p=0.008 n=5+5)
LoadWAL/batches=10,seriesPerBatch=1000,samplesPerSeries=480,exemplarsPerSeries=5,mmappedChunkT=0-8 1.23s ± 5% 0.27s ±33% -77.73% (p=0.008 n=5+5)
LoadWAL/batches=10,seriesPerBatch=1000,samplesPerSeries=480,exemplarsPerSeries=24,mmappedChunkT=0-8 1.28s ± 1% 0.36s ± 7% -71.51% (p=0.008 n=5+5)
LoadWAL/batches=100,seriesPerBatch=1000,samplesPerSeries=480,exemplarsPerSeries=0,mmappedChunkT=3800-8 12.1s ± 1% 3.1s ± 6% -74.33% (p=0.008 n=5+5)
LoadWAL/batches=100,seriesPerBatch=1000,samplesPerSeries=480,exemplarsPerSeries=2,mmappedChunkT=3800-8 12.1s ± 1% 3.4s ± 4% -71.94% (p=0.008 n=5+5)
LoadWAL/batches=100,seriesPerBatch=1000,samplesPerSeries=480,exemplarsPerSeries=5,mmappedChunkT=3800-8 12.1s ± 1% 3.8s ±17% -68.35% (p=0.008 n=5+5)
LoadWAL/batches=100,seriesPerBatch=1000,samplesPerSeries=480,exemplarsPerSeries=24,mmappedChunkT=3800-8 12.4s ± 1% 4.0s ±18% -67.71% (p=0.008 n=5+5)
```
Benchmarked on Linux
```
goos: linux
goarch: amd64
pkg: github.com/prometheus/prometheus/tsdb
cpu: 11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz
```
Signed-off-by: Jesus Vazquez <jesus.vazquez@grafana.com>
* Job queue
This PR reimplements chan chunkWriteJob with custom buffered queue that should use less memory, because it doesn't preallocate entire buffer for maximum queue size at once. Instead it allocates individual "segments" with smaller size.
As elements are added to the queue, they fill individual segments. When elements are removed from the queue (and segments), empty segments can be thrown away. This doesn't change memory usage of the queue when it's full, but should decrease its memory footprint when it's empty (queue will keep max 1 segment in such case).
Signed-off-by: Peter Štibraný <pstibrany@gmail.com>
* Modify test to work with low resolution timer.
Signed-off-by: Peter Štibraný <pstibrany@gmail.com>
* Improve comments.
Signed-off-by: Peter Štibraný <pstibrany@gmail.com>
* dont waste space on the chunkRefMap
* add time factor
* add comments
* better readability
* add instrumentation and more comments
* formatting
* uppercase comments
* Address review feedback. Renamed "free" to "shrink" everywhere, updated comments and threshold to 1000.
* double space
Signed-off-by: Mauro Stettler <mauro.stettler@gmail.com>
Co-authored-by: Peter Štibraný <pstibrany@gmail.com>
Co-authored-by: Mauro Stettler <mauro.stettler@gmail.com>
To avoid building up data in memory, commit and make a new appender
periodically.
The number `commitAfter = 10000` was chosen arbitrarily; testing with
10x more or less gives slightly worse results.
Signed-off-by: Bryan Boreham <bjboreham@gmail.com>
When restarting Prometheus I sometimes see:
caller=db.go:832 level=error component=tsdb msg="compaction failed" err="compact head: persist head block: 2 errors: populate block: context canceled; context canceled"
And prometheus_tsdb_compactions_failed_total metric gets incremented.
This makes it more difficult to write alerts based on
prometheus_tsdb_compactions_failed_total metric since any restart can
trigger it.
Signed-off-by: Łukasz Mierzwa <l.mierzwa@gmail.com>
The code sleeps for a short time to allow goroutines to finish, however
it seems the duration can be reduced a lot, speeding up the reading
process.
I checked using some WAL data from production, and the queue is almost
always empty at the time we enter `waitForIdle()` so there is no danger
of spinning in the tight loop.
Signed-off-by: Bryan Boreham <bjboreham@gmail.com>
Instead of creating a new hashing object every time, call `crc32.Checksum`
which computes the answer without allocations.
Signed-off-by: Bryan Boreham <bjboreham@gmail.com>
* Rename walDir parameter to dir
Signed-off-by: Matej Gera <matejgera@gmail.com>
* Improve NewQueueManager comment
Signed-off-by: Matej Gera <matejgera@gmail.com>
This commit fixes a typo when reporting an error that the the symbols
table size has been exceeded.
Signed-off-by: Filip Petkovski <filip.petkovsky@gmail.com>
On macOS, the TestTombstoneCleanRetentionLimitsRace performs very
poorly. It takes more than a second to write out one block, and as it
writes 400 of them, we run into the 10-minute test timeout frequently.
While this doesn't fix the actual performance issue, breaking each
iteration into a subtest makes the test pass reliably (because each
iteration comfortably finishes in under a minute).
Related report: https://groups.google.com/g/prometheus-developers/c/jxQ6Ayg6VJ4/m/03H_DS9PDAAJ
Signed-off-by: Matthias Rampke <matthias@prometheus.io>
"Labels is a sorted set of labels. Order has to be guaranteed upon
instantiation." says the comment, so fix all the tests that break this
rule.
For `BenchmarkLabelValuesWithMatchers()` and
`BenchmarkHeadLabelValuesWithMatchers()` the amount of work done changes
significantly if you put the labels in order, because all series refs
get neatly partitioned by the `tens` label, so I renamed the labels
to maintain the previous behaviour.
Signed-off-by: Bryan Boreham <bjboreham@gmail.com>
* refactor: move from io/ioutil to io and os packages
* use fs.DirEntry instead of os.FileInfo after os.ReadDir
Signed-off-by: MOREL Matthieu <matthieu.morel@cnp.fr>
* Add a test with variable samples rate append
This test overflows the chunk created in memseries, and the total amount
of samples in the (only) mmapped chunk is 29, instead of the 65565
appended ones.
Signed-off-by: Oleg Zaytsev <mail@olegzaytsev.com>
* Cut new chunk when rate prediction was wrong
When appending samples at a slow rate, and then appending at a higher
rate, the prediction we made to cut a new chunk is no longer valid.
Sometimes this can even cause an overflow in the chunk, if more samples
than uint16 can hold are appended.
Signed-off-by: Oleg Zaytsev <mail@olegzaytsev.com>
* Improve comment on 2*samplesPerChunk
Signed-off-by: Oleg Zaytsev <mail@olegzaytsev.com>
* Assert that all chunks have less than 240 samples
Also, trigger new chunk at 240, not at more than 240
Signed-off-by: Oleg Zaytsev <mail@olegzaytsev.com>
* tsdb/agent: Ignore duplicate exemplars
Signed-off-by: Paschalis Tsilias <paschalist0@gmail.com>
* Make each exemplar unique in TestCommit
Signed-off-by: Paschalis Tsilias <paschalist0@gmail.com>
* Re-Trigger CI for Windows and UI-related steps
Signed-off-by: Paschalis Tsilias <paschalist0@gmail.com>
* Change test comment to properly re-trigger pipeline
Signed-off-by: Paschalis Tsilias <paschalist0@gmail.com>
* Defer Close() calls for test agent and segment reader
Signed-off-by: Paschalis Tsilias <paschalist0@gmail.com>
* tsdb/agent: port grafana/agent#676grafana/agent#676 fixed an issue where a loading a WAL with multiple
segments may result in ref ID collision. The starting ref ID for new
series should be set to the highest ref ID across all series records
from all WAL segments.
This fixes an issue where the starting ref ID was incorrectly set to the
highest ref ID found in the newest segment, which may not have any ref
IDs at all if no series records have been appended to it yet.
Signed-off-by: Robert Fratto <robertfratto@gmail.com>
* tsdb/agent: update terminology (s/ref ID/nextRef)
Signed-off-by: Robert Fratto <robertfratto@gmail.com>
* tsdb: avoid slice-to-interface allocation in EnsureOrder
This is pulling the `seriesRefSlice` out of the loop, so the compiler
doesn't allocate a new one on the heap every time.
Signed-off-by: Bryan Boreham <bjboreham@gmail.com>
* tsdb: use pointer type in Pool for EnsureOrder
As noted by staticcheck, Pool prefers the objects in the pool to have
pointer type. This is a little more fiddly to code, but avoids
allocation of a wrapper object every time a slice is put into the pool.
Removed a comment that said fixing this has a performance penalty: not
borne out by benchmarks.
Signed-off-by: Bryan Boreham <bjboreham@gmail.com>
* Run gofumpt on all files
Getting golangci-lint errors when building on my laptop, possibly because I have newer version of gofumpt then what it was formatted with.
Run gofumpt -w -extra on all files as it will be needed in the future anyway.
* Update golangci-lint to v1.44.2
v1.44.0 upgraded gofumpt so bumping version in CI will help keep formatting correct for everyone
* Address golangci-lint error
Getting 'error-strings: error strings should not be capitalized or end with punctuation or a newline' from revive here.
Drop new line.
Signed-off-by: Łukasz Mierzwa <l.mierzwa@gmail.com>
* tsdb/agent: Fix deadlock from simultaneous GC and write
This commit fixes a potential deadlock where storing in-memory series
references could deadlock with a WAL GC cycle.
Signed-off-by: Robert Fratto <robertfratto@gmail.com>
* add missing license header
Signed-off-by: Robert Fratto <robertfratto@gmail.com>
* order local imports
Signed-off-by: Robert Fratto <robertfratto@gmail.com>
* align deadlock testing with discovery/manager_test.go method
Also prevents GCs from running concurrently, which could also cause a
deadlock (even though it's currently impossible for two GCs to run
concurrently).
Signed-off-by: Robert Fratto <robertfratto@gmail.com>
* Write chunks via queue, predicting the refs
Our load tests have shown that there is a latency spike in the
remote write handler whenever the head chunks need to be written,
because chunkDiskMapper.WriteChunk() blocks until the chunks are written
to disk.
This adds a queue to the chunk disk mapper which makes the WriteChunk()
method non-blocking unless the queue is full. Reads can still be served
from the queue.
Signed-off-by: Mauro Stettler <mauro.stettler@gmail.com>
* address PR feeddback
Signed-off-by: Mauro Stettler <mauro.stettler@gmail.com>
* initialize metrics without .Add(0)
Signed-off-by: Mauro Stettler <mauro.stettler@gmail.com>
* change isRunningMtx to normal lock
Signed-off-by: Mauro Stettler <mauro.stettler@gmail.com>
* do not re-initialize chunkrefmap
Signed-off-by: Mauro Stettler <mauro.stettler@gmail.com>
* update metric outside of lock scope
Signed-off-by: Mauro Stettler <mauro.stettler@gmail.com>
* add benchmark for adding job to chunk write queue
Signed-off-by: Mauro Stettler <mauro.stettler@gmail.com>
* remove unnecessary "success" var
Signed-off-by: Mauro Stettler <mauro.stettler@gmail.com>
* gofumpt -extra
Signed-off-by: Mauro Stettler <mauro.stettler@gmail.com>
* avoid WithLabelValues call in addJob
Signed-off-by: Mauro Stettler <mauro.stettler@gmail.com>
* format comments
Signed-off-by: Mauro Stettler <mauro.stettler@gmail.com>
* addressing PR feedback
Signed-off-by: Mauro Stettler <mauro.stettler@gmail.com>
* rename cutExpectRef to cutAndExpectRef
Signed-off-by: Mauro Stettler <mauro.stettler@gmail.com>
* use head.Init() instead of .initTime()
Signed-off-by: Mauro Stettler <mauro.stettler@gmail.com>
* address PR feedback
Signed-off-by: Mauro Stettler <mauro.stettler@gmail.com>
* PR feedback
Co-authored-by: Ganesh Vernekar <15064823+codesome@users.noreply.github.com>
Signed-off-by: Mauro Stettler <mauro.stettler@gmail.com>
* update test according to PR feedback
Signed-off-by: Mauro Stettler <mauro.stettler@gmail.com>
* replace callbackWg -> awaitCb
Signed-off-by: Mauro Stettler <mauro.stettler@gmail.com>
* better test of truncation with empty files
Signed-off-by: Mauro Stettler <mauro.stettler@gmail.com>
* replace callbackWg -> awaitCb
Signed-off-by: Mauro Stettler <mauro.stettler@gmail.com>
Co-authored-by: Ganesh Vernekar <15064823+codesome@users.noreply.github.com>
Unexported postingsWithIndexHeap's methods that don't need to be
exported, and added detailed comments.
Signed-off-by: Oleg Zaytsev <mail@olegzaytsev.com>
* tsdb: fix exemplar benchmarks
Go benchmarks are expected to do an amount of work that varies with
the `b.N` parameter. Previously these benchmarks would report a result
like 0.01 ns/op, which is nonsense.
Signed-off-by: Bryan Boreham <bjboreham@gmail.com>
* tsdb: use simpler map key to improve exemplar perf
Prometheus holds an index of exemplars so it can discard the oldest one
for a series when a new one is added.
Since the keys are not for human eyes, we can use a simpler format
and save the effort of quoting label values.
Signed-off-by: Bryan Boreham <bjboreham@gmail.com>
* Exemplars: allocate index map with estimated size
This avoids Go having to re-size the map several times as it grows.
16 exemplars per series is a guess; if it is too low then the map will
be sparse, while if it is too high then the map will have to resize once
or twice.
Signed-off-by: Bryan Boreham <bjboreham@gmail.com>
See this comment for detailed explanation:
https://github.com/prometheus/prometheus/pull/9907#issuecomment-1002189932
TL;DR: if we don't call Pop() on the heap implementation, we don't need
to return our param as an `interface{}` so we save an allocation.
This would be popped for every label value, so it can be thousands of
saved allocations here (see benchmarks).
Signed-off-by: Oleg Zaytsev <mail@olegzaytsev.com>
* storage: expose bug in iterators #10027
Signed-off-by: beorn7 <beorn@grafana.com>
* storage: fix bug #10027 in iterators' Seek method
Signed-off-by: beorn7 <beorn@grafana.com>
* Append reporting metrics without limit
If reporting metrics fails due to reaching the limit, this makes the
target appear as UP in the UI, but the metrics are missing.
This commit bypasses that limit for report metrics.
Signed-off-by: Julien Pivotto <roidelapluie@inuits.eu>
* Remove check against cfg so interval/ timeout are always set (#10023) (#10031)
Signed-off-by: Nicholas Blott <blottn@tcd.ie>
Co-authored-by: Nicholas Blott <blottn@tcd.ie>
* Cut v2.32.1
Signed-off-by: Julius Volz <julius.volz@gmail.com>
* Apply suggestions from code review
Signed-off-by: Julius Volz <julius.volz@gmail.com>
Co-authored-by: Levi Harrison <git@leviharrison.dev>
Co-authored-by: Julien Pivotto <roidelapluie@inuits.eu>
Co-authored-by: Nicholas Blott <blottn@tcd.ie>
Co-authored-by: Julius Volz <julius.volz@gmail.com>
Co-authored-by: Levi Harrison <git@leviharrison.dev>
The atomic Int64 type wasn't able to be represented when logging
via go-kit log and ended up as `"unsupported value type"`.
Signed-off-by: Nick Pillitteri <nick.pillitteri@grafana.com>
* tsdb/agent: synchronize appender code with grafana/agent main
This commit synchronize the appender code with grafana/agent main. This
includes adding support for appending exemplars.
Closes#9610
Signed-off-by: Robert Fratto <robertfratto@gmail.com>
* tsdb/agent: fix build error
Signed-off-by: Robert Fratto <robertfratto@gmail.com>
* tsdb/agent: introduce some exemplar tests, refactor tests a little
Signed-off-by: Robert Fratto <robertfratto@gmail.com>
* tsdb/agent: address review feedback
- Re-use hash when creating a new series
- Fix typo in exemplar append error
Signed-off-by: Robert Fratto <robertfratto@gmail.com>
* tsdb/agent: remove unused AddFast method
Signed-off-by: Robert Fratto <robertfratto@gmail.com>
* tsdb/agent: close wal reader after test
Signed-off-by: Robert Fratto <robertfratto@gmail.com>
* tsdb/agent: add out-of-order tracking, change series TS in commit
Signed-off-by: Robert Fratto <robertfratto@gmail.com>
* address review feedback
Signed-off-by: Robert Fratto <robertfratto@gmail.com>
There was a subtle and nasty bug in listSeriesIterator.Seek.
In addition, the Seek call is defined to be a no-op if the current
position of the iterator is already pointing to a suitable
sample. This commit adds fast paths for this case to several
potentially expensive Seek calls.
Another bug was in concreteSeriesIterator.Seek. It always searched the
whole series and not from the current position of the iterator.
Signed-off-by: beorn7 <beorn@grafana.com>
- Pick At... method via return value of Next/Seek.
- Do not clobber returned buckets.
- Add partial FloatHistogram suppert.
Note that the promql package is now _only_ dealing with
FloatHistograms, following the idea that PromQL only knows float
values.
As a byproduct, I have removed the histogramSeries metric. In my
understanding, series can have both float and histogram samples, so
that metric doesn't make sense anymore.
As another byproduct, I have converged the sampleBuf and the
histogramSampleBuf in memSeries into one. The sample type stored in
the sampleBuf has been extended to also contain histograms even before
this commit.
Signed-off-by: beorn7 <beorn@grafana.com>
* Support appending different sample types to the same series
Signed-off-by: Ganesh Vernekar <ganeshvern@gmail.com>
* Fix comments
Signed-off-by: Ganesh Vernekar <ganeshvern@gmail.com>
* Fix build
Signed-off-by: Ganesh Vernekar <ganeshvern@gmail.com>
* Fix panic on WAL replay
Signed-off-by: Ganesh Vernekar <ganeshvern@gmail.com>
* Refactor: introduce walSubsetProcessor
walSubsetProcessor packages up the `processWALSamples()` function and
its input and output channels, helping to clarify how these things
relate.
Signed-off-by: Bryan Boreham <bjboreham@gmail.com>
* Refactor: extract more methods onto walSubsetProcessor
This makes the main logic easier to follow.
Signed-off-by: Bryan Boreham <bjboreham@gmail.com>
* Fix race warning by locking processWALSamples
Although we have waited for the processor to finish, we still get a
warning from the race detector because it doesn't know how the different
parts relate.
Add a lock round each batch of samples, so the race detector can see
that we never access series owned by the processor outside of a lock.
Signed-off-by: Bryan Boreham <bjboreham@gmail.com>
* Added test to reproduce issue 9859
Signed-off-by: Marco Pracucci <marco@pracucci.com>
* Remove redundant unit test
Signed-off-by: Ganesh Vernekar <ganeshvern@gmail.com>
* Fix out of order chunks during WAL replay
Signed-off-by: Ganesh Vernekar <ganeshvern@gmail.com>
* Fix nits
Signed-off-by: Ganesh Vernekar <ganeshvern@gmail.com>
Co-authored-by: Ganesh Vernekar <ganeshvern@gmail.com>
Co-authored-by: Marco Pracucci <marco@pracucci.com>
Added validation to expected postings length compared to the bytes slice
length. With 32bit postings, we expect to have 4 bytes per each posting.
If the number doesn't add up, we know that the input data is not
compatible with our code (maybe it's cut, or padded with trash, or even
written in a different coded).
This is needed in downstream projects to correctly identify cached
postings written with an unknown codec, but it's also a good idea to
validate it here.
Signed-off-by: Oleg Zaytsev <mail@olegzaytsev.com>
* Disable isolation in isolation struct
Signed-off-by: darshanime <deathbullet@gmail.com>
* Run tsdb tests with isolation disabled
Signed-off-by: darshanime <deathbullet@gmail.com>
* Check for isolation disabled in isoState.Close()
Signed-off-by: darshanime <deathbullet@gmail.com>
* use t.Skip to skip isolation tests when disabled
Signed-off-by: darshanime <deathbullet@gmail.com>
* address review comments
Signed-off-by: darshanime <deathbullet@gmail.com>
* fix test for defaultIsolationState
Signed-off-by: darshanime <deathbullet@gmail.com>
* Change flag name. Set flag in DB. Do not init txRing. Close isoState.
Signed-off-by: Ganesh Vernekar <ganeshvern@gmail.com>
* Test disabled isolation in CircleCI test_go
Signed-off-by: Ganesh Vernekar <ganeshvern@gmail.com>
* Skip isolation related tests in db_test.go
Signed-off-by: Ganesh Vernekar <ganeshvern@gmail.com>
Co-authored-by: Ganesh Vernekar <ganeshvern@gmail.com>