use it in loadDataAsQueryable to make sure the RO Head doesn't truncate or cut new chunks in data/chunks_head/.
add a -sandbox-dir-root flag to "promtool tsdb dump/dump-openmetrics" to control the root of that sandbox dirrectory.
Signed-off-by: machine424 <ayoubmrini424@gmail.com>
Thanos can create and destroy TSDBs dynamically, and once a TSDB
disappears its files are deleted. Calculating the size of the
WAL then fails with errors like:
```
msg: "Failed to calculate size of "wal" dir", "err": "lstat
/tsdbdir/wal: no such file or directory", "caller": "wlog.go:271"
```
Signed-off-by: Giedrius Statkevičius <giedrius.statkevicius@vinted.com>
Signed-off-by: Jonathan Halterman <jonathan@grafana.com>
Signed-off-by: Jonathan Halterman <jhalterman@gmail.com>
Co-authored-by: Jesus Vazquez <jesusvazquez@users.noreply.github.com>
* Stop compactions if there's a block to write
db.Compact() checks if there's a block to write with HEAD chunks before calling db.compactBlocks().
This is to ensure that if we need to write a block then it happens ASAP, otherwise memory usage might keep growing.
But what can also happen is that we don't need to write any block, we start db.compactBlocks(),
compaction takes hours, and in the meantime HEAD needs to write out chunks to a block.
This can be especially problematic if, for example, you run Thanos sidecar that's uploading block,
which requires that compactions are disabled. Then you disable Thanos sidecar and re-enable compactions.
When db.compactBlocks() is finally called it might have a huge number of blocks to compact, which might
take a very long time, during which HEAD cannot write out chunks to a new block.
In such case memory usage will keep growing until either:
- compactions are finally finished and HEAD can write a block
- we run out of memory and Prometheus gets OOM-killed
This change adds a check for pending HEAD block writes inside db.compactBlocks(), so that
we bail out early if there are still compactions to run, but we also need to write a new
block.
Also add a test for compactBlocks.
---------
Signed-off-by: Łukasz Mierzwa <l.mierzwa@gmail.com>
Signed-off-by: Lukasz Mierzwa <lukasz@cloudflare.com>
* Use single bit to differentiate between optimized bounds and floats
Use one bit to decide what kind of data to read/write.
This reduces storage need of floats from 72 bits to 65 bits and makes the
integers store in 5 to 32 bits instead of 16.
Signed-off-by: György Krajcsovits <gyorgy.krajcsovits@grafana.com>
Signed-off-by: Jeanette Tan <jeanette.tan@grafana.com>
Signed-off-by: György Krajcsovits <gyorgy.krajcsovits@grafana.com>
Signed-off-by: Jeanette Tan <jeanette.tan@grafana.com>
Signed-off-by: George Krajcsovits <krajorama@users.noreply.github.com>
Co-authored-by: Jeanette Tan <jeanette.tan@grafana.com>
* TSDB: Don't compact the head block when empty
Don't compact the Head block if there have not yet been any samples
appended.
Previously, the logic for determining if the head should be compacted
relied on the default values for min and max time and integer overflow
when they were checked in `Head.compactable()`. The check in
`Head.compactable()` effectively did `math.MinInt64 - math.MaxInt64`
which overflowed and wrapped to `1`. Since `1` is less than `1.5`
times the chunk range, compaction did not happen. This was the correct
behavior but relying on overflow wrapping is surprising.
This change add a method for checking if the min and max time for the
head is unset and uses it to short-circuit compaction in that case.
It also replaces several explicit checks for the default value to
determine if the head has not yet had any samples added.
Signed-off-by: Nick Pillitteri <nick.pillitteri@grafana.com>
* tsdb: zero out Labels and memSeries pointers in pool
So that the garbage-collector doesn't see this memory as still in use.
Signed-off-by: Bryan Boreham <bjboreham@gmail.com>
---------
Signed-off-by: Bryan Boreham <bjboreham@gmail.com>
Signed-off-by: Björn Rabenstein <github@rabenste.in>
Co-authored-by: Björn Rabenstein <github@rabenste.in>
Dogfood native histograms.
Allow dependent projects to migrate to native histograms.
I took the defaults from client_golang.
Signed-off-by: György Krajcsovits <gyorgy.krajcsovits@grafana.com>
Clarify in the first comment that it is `watch()` that waits, and reduce
verbiage.
The second comment was slightly contradictory to the first and otherwise
didn't seem to add much, since `currentSegment` was incremented just a
few lines later.
Signed-off-by: Bryan Boreham <bjboreham@gmail.com>
* fix bug that would cause us to only read from the WAL on the 15s
fallback timer if remote write had fallen behind and is no longer
reading from the WAL segment that is currently being written to
Signed-off-by: Callum Styan <callumstyan@gmail.com>
* remove unintended logging, fix lint, plus allow test to take slightly
longer because cloud CI
Signed-off-by: Callum Styan <callumstyan@gmail.com>
* address review feedback
Signed-off-by: Callum Styan <callumstyan@gmail.com>
* fix watcher sleeps in test, flu brain is smooth
Signed-off-by: Callum Styan <callumstyan@gmail.com>
* increase timeout, unfortunately cloud CI can require a longer timeout
Signed-off-by: Callum Styan <callumstyan@gmail.com>
---------
Signed-off-by: Callum Styan <callumstyan@gmail.com>
For instance `require.NoError` will print the unexpected error; we don't
need to include it in the message.
Signed-off-by: Bryan Boreham <bjboreham@gmail.com>
'defer' runs at the end of the whole function; we should close each
segment file as soon as we finished reading it.
Signed-off-by: Bryan Boreham <bjboreham@gmail.com>
This PR is a reference implementation of the proposal described in #10420.
In addition to what described in #10420, in this PR I've introduced labels.StableHash(). The idea is to offer an hashing function which doesn't change over time, and that's used by query sharding in order to get a stable behaviour over time. The implementation of labels.StableHash() is the hashing function used by Prometheus before stringlabels, and what's used by Grafana Mimir for query sharding (because built before stringlabels was a thing).
Follow up work
As mentioned in #10420, if this PR is accepted I'm also open to upload another foundamental piece used by Grafana Mimir query sharding to accelerate the query execution: an optional, configurable and fast in-memory cache for the series hashes.
Signed-off-by: Marco Pracucci <marco@pracucci.com>
Mutex is 8 bytes; RWMutex is 24 bytes and much more complicated. Since
`RLock` is only used in two places, `UpdateMetadata` and `Delete`,
neither of which are hotspots, we should use the cheaper one.
Signed-off-by: Bryan Boreham <bjboreham@gmail.com>
Lifting an optimisation from Agent code, `seriesHashmap.del` can use
the unique series reference, doesn't need to check Labels.
Also streamline the logic for deleting from `unique` and `conflicts` maps,
and add some comments to help the next person.
Signed-off-by: Bryan Boreham <bjboreham@gmail.com>
Add test case to TestHeadLabelValuesWithMatchers, while fixing a couple
of typos in other test cases. Also enclosing some implicit sub-tests in a
`t.Run` call to make them explicitly sub-tests.
Signed-off-by: Arve Knudsen <arve.knudsen@gmail.com>
Optimize histogram iterators
Histogram iterators allocate new objects in the AtHistogram and
AtFloatHistogram methods, which makes calculating rates over long
ranges expensive.
In #13215 we allowed an existing object to be reused
when converting an integer histogram to a float histogram. This commit follows
the same idea and allows injecting an existing object in the AtHistogram and
AtFloatHistogram methods. When the injected value is nil, iterators allocate
new histograms, otherwise they populate and return the injected object.
The commit also adds a CopyTo method to Histogram and FloatHistogram which
is used in the BufferedIterator to overwrite items in the ring instead of making
new copies.
Note that a specialized HPoint pool is needed for all of this to work
(`matrixSelectorHPool`).
---------
Signed-off-by: Filip Petkovski <filip.petkovsky@gmail.com>
Co-authored-by: George Krajcsovits <krajorama@users.noreply.github.com>
Don't calculate postings beforehand: we may not need them. If all
matchers are for the requested label, we can just filter its values.
Also, if there are no values at all, no need to run any kind of
logic.
Also add more labelValuesWithMatchers benchmarks
Signed-off-by: Oleg Zaytsev <mail@olegzaytsev.com>
This functionality is needed in downstream projects because they have a
separate component that does compaction.
Upstreaming
7c8e9a2a76/tsdb/compact.go (L323-L325).
Signed-off-by: Giedrius Statkevičius <giedrius.statkevicius@vinted.com>
* tsdb/{index,compact}: allow using custom postings encoding format
We would like to experiment with a different postings encoding format in
Thanos so in this change I am proposing adding another argument to
`NewWriter` which would allow users to change the format if needed.
Also, wire the leveled compactor so that it would be possible to change
the format there too.
Signed-off-by: Giedrius Statkevičius <giedrius.statkevicius@vinted.com>
* tsdb/compact: use a struct for leveled compactor options
As discussed on Slack, let's use a struct for the options in leveled
compactor.
Signed-off-by: Giedrius Statkevičius <giedrius.statkevicius@vinted.com>
* tsdb: make changes after Bryan's review
- Make changes less intrusive
- Turn the postings encoder type into a function
- Add NewWriterWithEncoder()
Signed-off-by: Giedrius Statkevičius <giedrius.statkevicius@vinted.com>
---------
Signed-off-by: Giedrius Statkevičius <giedrius.statkevicius@vinted.com>
This reverts commit 2ddb3596ef.
Various tests are failing in CI after this change; reverting to free up
other work.
Signed-off-by: Bryan Boreham <bjboreham@gmail.com>
They are used in multiple repos, so common is a better place for them.
Several packages now don't depend on `model/textparse`, e.g.
`storage/remote`.
Also remove `metadata` struct from `api.go`, since it was identical to
a struct in the `metadata` package.
Signed-off-by: Bryan Boreham <bjboreham@gmail.com>
In https://github.com/prometheus/prometheus/pull/13276 we started reusing float histogram objects to reduce allocations in PromQL.
That PR introduces a bug where histogram pointers gets copied to the beginning of the histograms slice,
but are still kept in the end of the slice. When a new histogram is read into the last element,
it can overwrite a previous element because the pointer is the same.
This commit fixes the issue by moving outdated points to the end of the slice
so that we don't end up with duplicate pointers in the same buffer. In other words,
the slice gets rotated so that old objects can get reused.
Signed-off-by: Filip Petkovski <filip.petkovsky@gmail.com>
Digging around the TSDB code and I've found that this flag is unused so
let's remove it.
Signed-off-by: Giedrius Statkevičius <giedrius.statkevicius@vinted.com>
* Append created timestamps.
Signed-off-by: Arthur Silva Sens <arthur.sens@coralogix.com>
* Log when created timestamps are ignored
Signed-off-by: Arthur Silva Sens <arthur.sens@coralogix.com>
* Proposed changes to Append CT PR.
Changes:
* Changed textparse Parser interface for consistency and robustness.
* Changed CT interface to be more explicit and handle validation.
* Simplified test, change scrapeManager to allow testability.
* Added TODOs.
Signed-off-by: bwplotka <bwplotka@gmail.com>
* Updates.
Signed-off-by: bwplotka <bwplotka@gmail.com>
* Addressed comments.
Signed-off-by: bwplotka <bwplotka@gmail.com>
* Refactor head_appender test
Signed-off-by: Arthur Silva Sens <arthur.sens@coralogix.com>
* Fix linter issues
Signed-off-by: Arthur Silva Sens <arthur.sens@coralogix.com>
* Use model.Sample in head appender test
Signed-off-by: Arthur Silva Sens <arthur.sens@coralogix.com>
---------
Signed-off-by: Arthur Silva Sens <arthur.sens@coralogix.com>
Signed-off-by: bwplotka <bwplotka@gmail.com>
Co-authored-by: bwplotka <bwplotka@gmail.com>
It's faster.
Note change to test - instead of requiring that the data structure is
identical to `EmptyPostings()`, check that calling `Next()` returns
false, which implies it was empty.
Also the check for context cancellation during initialization was
removed. Initialization should be a small portion of the work done
during merge, so it's not worth plumbing a context argument through.
Signed-off-by: Bryan Boreham <bjboreham@gmail.com>
The 'ToFloat' method on integer histograms currently allocates new memory
each time it is called.
This commit adds an optional *FloatHistogram parameter that can be used
to reuse span and bucket slices. It is up to the caller to make sure the
input float histogram is not used anymore after the call.
Signed-off-by: Filip Petkovski <filip.petkovsky@gmail.com>
* Fix histogram append errors
We should check counterReset condition rather than okToAppend because if
there's a counter reset, okToAppend is always set to false.
Signed-off-by: Fiona Liao <fiona.y.liao@gmail.com>
When no samples are returned in a chunk because all the samples have
been deleted, the chunk iterator then stops without iterating through
any remaining chunks.
Signed-off-by: Fiona Liao <fiona.y.liao@gmail.com>
* Fix tsdb.stripeSeries.gc so it handles conflicts properly
tsdb.stripeSeries.gc needs to prune seriesHashmap.conflicts first,
otherwise seriesHashmap replaces the unique field with the first among
the conflicts. Also add regression test.
Signed-off-by: Arve Knudsen <arve.knudsen@gmail.com>
* TestStripeSeries_gc: Support stringlabels, don't use internals
Signed-off-by: Arve Knudsen <arve.knudsen@gmail.com>
---------
Signed-off-by: Arve Knudsen <arve.knudsen@gmail.com>
The ChunkReader interface's Chunk() has been changed to ChunkOrIterable().
This is a precursor to OOO native histogram support - with OOO native histograms, the chunks.Meta passed to Chunk() can result in multiple chunks being returned rather than just a single chunk (e.g. if oooMergedChunk has a counter reset in the middle).
To support this, ChunkOrIterable() requires either a single chunk or an iterable to be returned. If an iterable is returned, the caller has the responsibility of converting the samples from the iterable into possibly multiple chunks. The OOOHeadChunkReader now returns an iterable rather than a chunk to prepare for the native histograms case. Also as a beneficial side effect, oooMergedChunk and boundedChunk has been simplified as they only need to implement the Iterable interface now, not the full Chunk interface.
---------
Signed-off-by: Fiona Liao <fiona.y.liao@gmail.com>
Co-authored-by: George Krajcsovits <krajorama@users.noreply.github.com>
* Fix tsdb.seriesHashmap.set by making receiver a pointer
The method tsdb.seriesHashmap.set currently doesn't set the conflicts
field properly, due to the receiver being a non-pointer. Fix by turning
the receiver into a pointer, and add a corresponding regression test.
Signed-off-by: Arve Knudsen <arve.knudsen@gmail.com>
* Add failing test.
Signed-off-by: Charles Korn <charles.korn@grafana.com>
* Don't run OOO head garbage collection while reads are running.
Signed-off-by: Charles Korn <charles.korn@grafana.com>
* Add further test cases for different order of operations.
Signed-off-by: Charles Korn <charles.korn@grafana.com>
* Ensure all queriers are closed if `DB.blockChunkQuerierForRange()` fails.
Signed-off-by: Charles Korn <charles.korn@grafana.com>
* Ensure all queriers are closed if `DB.Querier()` fails.
Signed-off-by: Charles Korn <charles.korn@grafana.com>
* Invert error handling in `DB.Querier()` and `DB.blockChunkQuerierForRange()` to make it clearer
Signed-off-by: Charles Korn <charles.korn@grafana.com>
* Ensure that queries that touch OOO data can't block OOO head garbage collection forever.
Signed-off-by: Charles Korn <charles.korn@grafana.com>
* Address PR feedback: fix parameter name in comment
Co-authored-by: Jesus Vazquez <jesusvazquez@users.noreply.github.com>
Signed-off-by: Charles Korn <charleskorn@users.noreply.github.com>
* Address PR feedback: use `lastGarbageCollectedMmapRef`
Signed-off-by: Charles Korn <charles.korn@grafana.com>
* Address PR feedback: ensure pending reads are cleaned up if creating an OOO querier fails
Signed-off-by: Charles Korn <charles.korn@grafana.com>
---------
Signed-off-by: Charles Korn <charles.korn@grafana.com>
Signed-off-by: Charles Korn <charleskorn@users.noreply.github.com>
Co-authored-by: Jesus Vazquez <jesusvazquez@users.noreply.github.com>
* Make head block ULIDs descriptive
As far as I understand, these ULIDs aren't persisted anywhere, so it
should be safe to change them.
When debugging an issue, seeing an ULID like
`2ZBXFNYVVFDXFPGSB1CHFNYQTZ` or `33DXR7JA39CHDKMQ9C40H6YVVF` isn't very
helpful, so I propose to make them readable in their ULID string
version.
Signed-off-by: Oleg Zaytsev <mail@olegzaytsev.com>
* Set a different ULID for RangeHead
Signed-off-by: Oleg Zaytsev <mail@olegzaytsev.com>
---------
Signed-off-by: Oleg Zaytsev <mail@olegzaytsev.com>
Fix and improve ingesting exemplars for native histograms.
See code comment for a detailed explanation of the algorithm.
Note that this changes the current behavior for all kind of samples slightly: We now allow exemplars with the same timestamp as during the last scrape if the value or the labels have changed.
Also note that we now do not ingest exemplars without timestamps for native histograms anymore.
Signed-off-by: Jeanette Tan <jeanette.tan@grafana.com>
Signed-off-by: György Krajcsovits <gyorgy.krajcsovits@grafana.com>
Co-authored-by: Björn Rabenstein <github@rabenste.in>
---------
Signed-off-by: Jeanette Tan <jeanette.tan@grafana.com>
Signed-off-by: György Krajcsovits <gyorgy.krajcsovits@grafana.com>
Signed-off-by: zenador <zenador@users.noreply.github.com>
Co-authored-by: György Krajcsovits <gyorgy.krajcsovits@grafana.com>
Co-authored-by: Björn Rabenstein <github@rabenste.in>
* Labels: reduce allocations when creating from TSDB
When reading the WAL, by passing references into the buffer we can avoid
copying strings under `-tags stringlabels`.
Signed-off-by: Bryan Boreham <bjboreham@gmail.com>
* Fix panic during tsdb Commit
Fixes the following
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x20 pc=0x19deb45]
goroutine 651118930 [running]:
github.com/prometheus/prometheus/tsdb.(*headAppender).Commit(0xc19100f7c0)
/drone/src/vendor/github.com/prometheus/prometheus/tsdb/head_append.go:855 +0x245
github.com/prometheus/prometheus/tsdb.dbAppender.Commit({{0x35bd6f0?, 0xc19100f7c0?}, 0xc000fa4c00?})
/drone/src/vendor/github.com/prometheus/prometheus/tsdb/db.go:1159 +0x2f
We theorize that the panic happened due the the series referenced by the
exemplar being removed between AppendExemplar and Commit due to being idle.
Signed-off-by: György Krajcsovits <gyorgy.krajcsovits@grafana.com>
When samples are committed in the head, they are also written to the WAL.
The order of WAL records should be sample then exemplar, but this was
not the case for native histogram samples. This PR fixes that.
The problem with the wrong order is that remote write reads the WAL and
sends the recorded timeseries in the WAL order, which means exemplars
arrived before histogram samples. If the receiving side is Prometheus
TSDB and the series has not existed before then the exemplar does not
currently create the series. Which means the exemplar is rejected and lost.
Signed-off-by: György Krajcsovits <gyorgy.krajcsovits@grafana.com>
Instead of a map of slices of `*memSeries`, ready for any of them to
hold series where hash values collide, split into a map of `*memSeries`
and a map of slices which is usually empty, since hash collisions are
a one-in-a-billion thing.
The `del` method gets more complicated, to maintain the invariant that
a series is only in one of the two maps.
Signed-off-by: Bryan Boreham <bjboreham@gmail.com>
Extract the middle of the loop into a function, so it will be
easier to modify the `seriesHashmap` data structure.
Signed-off-by: Bryan Boreham <bjboreham@gmail.com>