Relates to @bboreham optimization in https://github.com/prometheus/prometheus/pull/10859
Bryan did reduce the sleep time improving the deltas on the benchmark by
quite a lot. However I've been working on a similar implementation for
out of order and I noticed that we actually get into this method
thousands of times.
@ywwg had the brilliant idea of not always sleeping before the select
but actually make it a case in the select so we only sleep if we need
to.
The benchmark deltas are amazing
```
❯ benchstat old_implementation.txt new_implementation_using_time_after.txt
name old time/op new time/op delta
LoadWAL/batches=10,seriesPerBatch=100,samplesPerSeries=7200,exemplarsPerSeries=0,mmappedChunkT=0-8 521ms ±25% 253ms ± 6% -51.47% (p=0.008 n=5+5)
LoadWAL/batches=10,seriesPerBatch=100,samplesPerSeries=7200,exemplarsPerSeries=36,mmappedChunkT=0-8 773ms ± 3% 369ms ±31% -52.23% (p=0.008 n=5+5)
LoadWAL/batches=10,seriesPerBatch=100,samplesPerSeries=7200,exemplarsPerSeries=72,mmappedChunkT=0-8 592ms ±28% 297ms ±28% -49.80% (p=0.008 n=5+5)
LoadWAL/batches=10,seriesPerBatch=100,samplesPerSeries=7200,exemplarsPerSeries=360,mmappedChunkT=0-8 547ms ± 2% 999ms ±187% ~ (p=0.690 n=5+5)
LoadWAL/batches=10,seriesPerBatch=10000,samplesPerSeries=50,exemplarsPerSeries=0,mmappedChunkT=0-8 11.3s ± 4% 1.3s ±44% -88.48% (p=0.008 n=5+5)
LoadWAL/batches=10,seriesPerBatch=10000,samplesPerSeries=50,exemplarsPerSeries=2,mmappedChunkT=0-8 11.1s ± 1% 1.2s ±20% -89.08% (p=0.008 n=5+5)
LoadWAL/batches=10,seriesPerBatch=1000,samplesPerSeries=480,exemplarsPerSeries=0,mmappedChunkT=0-8 1.24s ± 3% 0.18s ± 7% -85.76% (p=0.008 n=5+5)
LoadWAL/batches=10,seriesPerBatch=1000,samplesPerSeries=480,exemplarsPerSeries=2,mmappedChunkT=0-8 1.24s ± 2% 0.18s ± 5% -85.24% (p=0.008 n=5+5)
LoadWAL/batches=10,seriesPerBatch=1000,samplesPerSeries=480,exemplarsPerSeries=5,mmappedChunkT=0-8 1.23s ± 5% 0.27s ±33% -77.73% (p=0.008 n=5+5)
LoadWAL/batches=10,seriesPerBatch=1000,samplesPerSeries=480,exemplarsPerSeries=24,mmappedChunkT=0-8 1.28s ± 1% 0.36s ± 7% -71.51% (p=0.008 n=5+5)
LoadWAL/batches=100,seriesPerBatch=1000,samplesPerSeries=480,exemplarsPerSeries=0,mmappedChunkT=3800-8 12.1s ± 1% 3.1s ± 6% -74.33% (p=0.008 n=5+5)
LoadWAL/batches=100,seriesPerBatch=1000,samplesPerSeries=480,exemplarsPerSeries=2,mmappedChunkT=3800-8 12.1s ± 1% 3.4s ± 4% -71.94% (p=0.008 n=5+5)
LoadWAL/batches=100,seriesPerBatch=1000,samplesPerSeries=480,exemplarsPerSeries=5,mmappedChunkT=3800-8 12.1s ± 1% 3.8s ±17% -68.35% (p=0.008 n=5+5)
LoadWAL/batches=100,seriesPerBatch=1000,samplesPerSeries=480,exemplarsPerSeries=24,mmappedChunkT=3800-8 12.4s ± 1% 4.0s ±18% -67.71% (p=0.008 n=5+5)
```
Benchmarked on Linux
```
goos: linux
goarch: amd64
pkg: github.com/prometheus/prometheus/tsdb
cpu: 11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz
```
Signed-off-by: Jesus Vazquez <jesus.vazquez@grafana.com>
* Avoid gaps in in-order data after restart with out-of-order enabled
Signed-off-by: Ganesh Vernekar <ganeshvern@gmail.com>
* Fix tests, do the temporary patch only if OOO is enabled
Signed-off-by: Ganesh Vernekar <ganeshvern@gmail.com>
* Avoid Peter's confusion
Signed-off-by: Ganesh Vernekar <ganeshvern@gmail.com>
* Use latest OutOfOrderTimeWindow
Signed-off-by: Ganesh Vernekar <ganeshvern@gmail.com>
This implementation is based on this design doc:
https://docs.google.com/document/d/1Kppm7qL9C-BJB1j6yb6-9ObG3AbdZnFUBYPNNWwDBYM/edit?usp=sharing
This commit adds support to accept out-of-order ("OOO") sample into the TSDB
up to a configurable time allowance. If OOO is enabled, overlapping querying
are automatically enabled.
Signed-off-by: Ganesh Vernekar <ganeshvern@gmail.com>
Co-authored-by: Jesus Vazquez <jesus.vazquez@grafana.com>
Co-authored-by: Ganesh Vernekar <ganeshvern@gmail.com>
Co-authored-by: Dieter Plaetinck <dieter@grafana.com>
The code sleeps for a short time to allow goroutines to finish, however
it seems the duration can be reduced a lot, speeding up the reading
process.
I checked using some WAL data from production, and the queue is almost
always empty at the time we enter `waitForIdle()` so there is no danger
of spinning in the tight loop.
Signed-off-by: Bryan Boreham <bjboreham@gmail.com>
* refactor: move from io/ioutil to io and os packages
* use fs.DirEntry instead of os.FileInfo after os.ReadDir
Signed-off-by: MOREL Matthieu <matthieu.morel@cnp.fr>
* track OOO-ness in tsdb into a histogram
This should help with building an understanding of customer requirements
as far as how far back samples tend to go.
Note: this has already been discussed, reviewed and approved on:
https://github.com/grafana/mimir/pull/488
Signed-off-by: Dieter Plaetinck <dieter@grafana.com>
* update the test files which were not in mimir/vendor
* Fix panic on WAL replay
Signed-off-by: Ganesh Vernekar <ganeshvern@gmail.com>
* Refactor: introduce walSubsetProcessor
walSubsetProcessor packages up the `processWALSamples()` function and
its input and output channels, helping to clarify how these things
relate.
Signed-off-by: Bryan Boreham <bjboreham@gmail.com>
* Refactor: extract more methods onto walSubsetProcessor
This makes the main logic easier to follow.
Signed-off-by: Bryan Boreham <bjboreham@gmail.com>
* Fix race warning by locking processWALSamples
Although we have waited for the processor to finish, we still get a
warning from the race detector because it doesn't know how the different
parts relate.
Add a lock round each batch of samples, so the race detector can see
that we never access series owned by the processor outside of a lock.
Signed-off-by: Bryan Boreham <bjboreham@gmail.com>
* Added test to reproduce issue 9859
Signed-off-by: Marco Pracucci <marco@pracucci.com>
* Remove redundant unit test
Signed-off-by: Ganesh Vernekar <ganeshvern@gmail.com>
* Fix out of order chunks during WAL replay
Signed-off-by: Ganesh Vernekar <ganeshvern@gmail.com>
* Fix nits
Signed-off-by: Ganesh Vernekar <ganeshvern@gmail.com>
Co-authored-by: Ganesh Vernekar <ganeshvern@gmail.com>
Co-authored-by: Marco Pracucci <marco@pracucci.com>
This creates a new `model` directory and moves all data-model related
packages over there:
exemplar labels relabel rulefmt textparse timestamp value
All the others are more or less utilities and have been moved to `util`:
gate logging modetimevfs pool runtime
Signed-off-by: beorn7 <beorn@grafana.com>
* TSDB: demistify seriesRefs and ChunkRefs
The TSDB package contains many types of series and chunk references,
all shrouded in uint types. Often the same uint value may
actually mean one of different types, in non-obvious ways.
This PR aims to clarify the code and help navigating to relevant docs,
usage, etc much quicker.
Concretely:
* Use appropriately named types and document their semantics and
relations.
* Make multiplexing and demuxing of types explicit
(on the boundaries between concrete implementations and generic
interfaces).
* Casting between different types should be free. None of the changes
should have any impact on how the code runs.
TODO: Implement BlockSeriesRef where appropriate (for a future PR)
Signed-off-by: Dieter Plaetinck <dieter@grafana.com>
* feedback
Signed-off-by: Dieter Plaetinck <dieter@grafana.com>
* agent: demistify seriesRefs and ChunkRefs
Signed-off-by: Dieter Plaetinck <dieter@grafana.com>
* BenchmarkLoadWAL: close WAL after use
So that goroutines are stopped and resources released
Signed-off-by: Bryan Boreham <bjboreham@gmail.com>
* BenchmarkLoadWAL: make series IDs co-prime with #workers
Series are distributed across workers by taking the modulus of the
ID with the number of workers, so multiples of 100 are a poor choice.
Signed-off-by: Bryan Boreham <bjboreham@gmail.com>
* BenchmarkLoadWAL: simulate mmapped chunks
Real Prometheus cuts chunks every 120 samples, then skips those samples
when re-reading the WAL. Simulate this by creating a single mapped chunk
for each series, since the max time is all the reader looks at.
Signed-off-by: Bryan Boreham <bjboreham@gmail.com>
* Fix comment
Signed-off-by: Bryan Boreham <bjboreham@gmail.com>
* Remove series map from processWALSamples()
The locks that is commented to reduce contention in are now sharded
32,000 ways, so won't be contended. Removing the map saves memory and
goes just as fast.
Signed-off-by: Bryan Boreham <bjboreham@gmail.com>
* loadWAL: Cache the last mmapped chunk time
So we can skip calling append() for samples it will reject.
Signed-off-by: Bryan Boreham <bjboreham@gmail.com>
* Improvements from code review
Signed-off-by: Bryan Boreham <bjboreham@gmail.com>
* Full stops and capitals on comments
Signed-off-by: Bryan Boreham <bjboreham@gmail.com>
* Cache max time in both places mmappedChunks is updated
Including refactor to extract function `setMMappedChunks`, to reduce
code duplication.
Signed-off-by: Bryan Boreham <bjboreham@gmail.com>
* Update head min/max time when mmapped chunks added
This ensures we have the correct values if no WAL samples are added for
that series.
Note that `mSeries.maxTime()` was always `math.MinInt64` before, since
that function doesn't consider mmapped chunks.
Signed-off-by: Bryan Boreham <bjboreham@gmail.com>
* Avoid deadlock when processing duplicate series record
`processWALSamples()` needs to be able to send on its output channel
before it can read the input channel, so reads to allow this in case the
output channel is full.
Signed-off-by: Bryan Boreham <bjboreham@gmail.com>
* processWALSamples: update comment
Previous text seems to relate to an earlier implementation.
Signed-off-by: Bryan Boreham <bjboreham@gmail.com>