* Write exemplars to the WAL and send them over remote write.
Signed-off-by: Callum Styan <callumstyan@gmail.com>
* Update example for exemplars, print data in a more obvious format.
Signed-off-by: Callum Styan <callumstyan@gmail.com>
* Add metrics for remote write of exemplars.
Signed-off-by: Callum Styan <callumstyan@gmail.com>
* Fix incorrect slices passed to send in remote write.
Signed-off-by: Callum Styan <callumstyan@gmail.com>
* We need to unregister the new metrics.
Signed-off-by: Callum Styan <callumstyan@gmail.com>
* Address review comments
Signed-off-by: Callum Styan <callumstyan@gmail.com>
* Order of exemplar append vs write exemplar to WAL needs to change.
Signed-off-by: Callum Styan <callumstyan@gmail.com>
* Several fixes to prevent sending uninitialized or incorrect samples with an exemplar. Fix dropping exemplar for missing series. Add tests for queue_manager sending exemplars
Signed-off-by: Martin Disibio <mdisibio@gmail.com>
* Store both samples and exemplars in the same timeseries buffer to remove the alloc when building final request, keep sub-slices in separate buffers for re-use
Signed-off-by: Martin Disibio <mdisibio@gmail.com>
* Condense sample/exemplar delivery tests to parameterized sub-tests
Signed-off-by: Martin Disibio <mdisibio@gmail.com>
* Rename test methods for clarity now that they also handle exemplars
Signed-off-by: Martin Disibio <mdisibio@gmail.com>
* Rename counter variable. Fix instances where metrics were not updated correctly
Signed-off-by: Martin Disibio <mdisibio@gmail.com>
* Add exemplars to LoadWAL benchmark
Signed-off-by: Callum Styan <callumstyan@gmail.com>
* last exemplars timestamp metric needs to convert value to seconds with
ms precision
Signed-off-by: Callum Styan <callumstyan@gmail.com>
* Process exemplar records in a separate go routine when loading the WAL.
Signed-off-by: Callum Styan <callumstyan@gmail.com>
* Address review comments related to clarifying comments and variable
names. Also refactor sample/exemplar to enqueue prompb types.
Signed-off-by: Callum Styan <callumstyan@gmail.com>
* Regenerate types proto with comments, update protoc version again.
Signed-off-by: Callum Styan <callumstyan@gmail.com>
* Put remote write of exemplars behind a feature flag.
Signed-off-by: Callum Styan <callumstyan@gmail.com>
* Address some of Ganesh's review comments.
Signed-off-by: Callum Styan <callumstyan@gmail.com>
* Move exemplar remote write feature flag to a config file field.
Signed-off-by: Callum Styan <callumstyan@gmail.com>
* Address Bartek's review comments.
Signed-off-by: Callum Styan <callumstyan@gmail.com>
* Don't allocate exemplar buffers in queue_manager if we're not going to
send exemplars over remote write.
Signed-off-by: Callum Styan <callumstyan@gmail.com>
* Add ValidateExemplar function, validate exemplars when appending to head
and log them all to WAL before adding them to exemplar storage.
Signed-off-by: Callum Styan <callumstyan@gmail.com>
* Address more reivew comments from Ganesh.
Signed-off-by: Callum Styan <callumstyan@gmail.com>
* Add exemplar total label length check.
Signed-off-by: Callum Styan <callumstyan@gmail.com>
* Address a few last review comments
Signed-off-by: Callum Styan <callumstyan@gmail.com>
Co-authored-by: Martin Disibio <mdisibio@gmail.com>
* scrape: add label limits per scrape
Add three new limits to the scrape configuration to provide some
mechanism to defend against unbound number of labels and excessive
label lengths. If any of these limits are broken by a sample from a
scrape, the whole scrape will fail. For all of these configuration
options, a zero value means no limit.
The `label_limit` configuration will provide a mechanism to bound the
number of labels per-scrape of a certain sample to a user defined limit.
This limit will be tested against the sample labels plus the discovery
labels, but it will exclude the __name__ from the count since it is a
mandatory Prometheus label to which applying constraints isn't
meaningful.
The `label_name_length_limit` and `label_value_length_limit` will
prevent having labels of excessive lengths. These limits also skip the
__name__ label for the same reasons as the `label_limit` option and will
also make the scrape fail if any sample has a label name/value length
that exceed the predefined limits.
Signed-off-by: Damien Grisonnet <dgrisonn@redhat.com>
* scrape: add metrics and alert to label limits
Add three gauge, one for each label limit to easily access the
limit set by a certain scrape target.
Also add a counter to count the number of targets that exceeded the
label limits and thus were dropped. This is useful for the
`PrometheusLabelLimitHit` alert that will notify the users that scraping
some targets failed because they had samples exceeding the label limits
defined in the scrape configuration.
Signed-off-by: Damien Grisonnet <dgrisonn@redhat.com>
* scrape: apply label limits to __name__ label
Apply limits to the __name__ label that was previously skipped and
truncate the label names and values in the error messages as they can be
very very long.
Signed-off-by: Damien Grisonnet <dgrisonn@redhat.com>
* scrape: remove label limits gauges and refactor
Remove `prometheus_target_scrape_pool_label_limit`,
`prometheus_target_scrape_pool_label_name_length_limit`, and
`prometheus_target_scrape_pool_label_value_length_limit` as they are not
really useful since we don't have the information on the labels in it.
Signed-off-by: Damien Grisonnet <dgrisonn@redhat.com>
This moves the label lookup into TSDB, whilst still keeping the cached-ref optimisation for repeated Appends.
This makes the API easier to consume and implement. In particular this change is motivated by the scrape-time-aggregation work, which I don't think is possible to implement without it as it needs access to label values.
Signed-off-by: Tom Wilkie <tom.wilkie@gmail.com>
Manager.reload takes the mutex that would make it safe, however
releases it before the goroutines spawned are finished with it.
Thus more explicit locking of scrapePool.Sync/stop/reload is needed.
Signed-off-by: Brian Brazil <brian.brazil@robustperception.io>
* Testify: move to require
Moving testify to require to fail tests early in case of errors.
Signed-off-by: Julien Pivotto <roidelapluie@inuits.eu>
* More moves
Signed-off-by: Julien Pivotto <roidelapluie@inuits.eu>
* Refactor test assertions
This pull request gets rid of assert.True where possible to use
fine-grained assertions.
Signed-off-by: Julien Pivotto <roidelapluie@inuits.eu>
Don't lock for all of Sync/stop/reload as that holds up /metrics and the
UI when they want a list of active/dropped targets. Instead take
advantage of the fact that Sync/stop/reload cannot be called
concurrently by the scrape Manager and lock just on the targets
themselves.
Signed-off-by: Brian Brazil <brian.brazil@robustperception.io>
This PR test that de-duplicated targets are actually started.
It is a unit test for this line of code:
072b9649a3/scrape/scrape.go (L457)
which is working and necessary but was not tested yet.
It also tests that scrapes are started in the normal way, in the targets
limit test.
Signed-off-by: Julien Pivotto <roidelapluie@inuits.eu>
This also fixes a bug in query_log_file, which now is relative to the config file like all other paths.
Signed-off-by: Andy Bursavich <abursavich@gmail.com>
When I started wotking on target_limit, scrapeAndReport did not exist
yet. Then I simply rebased my work without thinking.
It appears that there is a lot that can be inline if I defer() the
report.
Signed-off-by: Julien Pivotto <roidelapluie@inuits.eu>
With this, the storage tests inside the scrape package are more
realistic.
Discovered with #7593, but fixed independently as #7593 will probably
take some time.
Signed-off-by: Julien Pivotto <roidelapluie@inuits.eu>
* Add errors and Warnings to SeriesSet
Signed-off-by: Kemal Akkoyun <kakkoyun@gmail.com>
* Change Querier interface and refactor accordingly
Signed-off-by: Kemal Akkoyun <kakkoyun@gmail.com>
* Refactor promql/engine to propagate warnings at eval stage
Signed-off-by: Kemal Akkoyun <kakkoyun@gmail.com>
* Address review issues
Signed-off-by: Kemal Akkoyun <kakkoyun@gmail.com>
* Make sure all the series from all Selects are pre-advanced
Signed-off-by: Kemal Akkoyun <kakkoyun@gmail.com>
* Address review issues
Signed-off-by: Kemal Akkoyun <kakkoyun@gmail.com>
* Separate merge series sets
Signed-off-by: Kemal Akkoyun <kakkoyun@gmail.com>
* Clean
Signed-off-by: Kemal Akkoyun <kakkoyun@gmail.com>
* Refactor merge querier failure handling
Signed-off-by: Kemal Akkoyun <kakkoyun@gmail.com>
* Refactored and simplified fanout with improvements from incoming chunk iterator PRs.
* Secondary logic is hidden, instead of weird failed series set logic we had.
* Fanout is well commented
* Fanout closing record all errors
* MergeQuerier improved API (clearer)
* deferredGenericMergeSeriesSet is not needed as we return no samples anyway for failed series sets (next = false).
Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com>
* Fix formatting
Signed-off-by: Kemal Akkoyun <kakkoyun@gmail.com>
* Fix CI issues
Signed-off-by: Kemal Akkoyun <kakkoyun@gmail.com>
* Added final tests for error handling.
Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com>
* Addressed Brian's comments.
* Moved hints in populate to be allocated only when needed.
* Used sync.Once in secondary Querier to achieve all-or-nothing partial response logic.
* Select after first Next is done will panic.
NOTE: in lazySeriesSet in theory we could just panic, I think however we can
totally just return error, it will panic in expand anyway.
Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com>
* Utilize errWithWarnings
Signed-off-by: Kemal Akkoyun <kakkoyun@gmail.com>
* Fix recently introduced expansion issue
Signed-off-by: Kemal Akkoyun <kakkoyun@gmail.com>
* Add tests for secondary querier error handling
Signed-off-by: Kemal Akkoyun <kakkoyun@gmail.com>
* Implement lazy merge
Signed-off-by: Kemal Akkoyun <kakkoyun@gmail.com>
* Add name to test cases
Signed-off-by: Kemal Akkoyun <kakkoyun@gmail.com>
* Reorganize
Signed-off-by: Kemal Akkoyun <kakkoyun@gmail.com>
* Address review comments
Signed-off-by: Kemal Akkoyun <kakkoyun@gmail.com>
* Address review comments
Signed-off-by: Kemal Akkoyun <kakkoyun@gmail.com>
* Remove redundant warnings
Signed-off-by: Kemal Akkoyun <kakkoyun@gmail.com>
* Fix rebase mistake
Signed-off-by: Kemal Akkoyun <kakkoyun@gmail.com>
Co-authored-by: Bartlomiej Plotka <bwplotka@gmail.com>
* Separate scrape add error checking out into it's own function.
Signed-off-by: Callum Styan <callumstyan@gmail.com>
* pass sampleLimitError to checkAddError instead of returning an error
Signed-off-by: Callum Styan <callumstyan@gmail.com>
* Return bool, error from checkAddError so we can properly handle
ErrNotFound for AddFast. This should in theory never happen, but the
previous code path handled this case. Adds a test for this, which master
passes and the previous commit fails.
Signed-off-by: Callum Styan <callumstyan@gmail.com>
* Address comment changes.
Signed-off-by: Callum Styan <callumstyan@gmail.com>
* Move sampleAdded inside the loop iteration within append, since that's
the only block the variable is used in.
Signed-off-by: Callum Styan <callumstyan@gmail.com>
This is technically BREAKING CHANGE, but it was like this from the beginning: I just notice that we rely in
Prometheus on remote read being sorted. This is because we use selected data from remote reads in MergeSeriesSet
which rely on sorting.
I found during work on https://github.com/prometheus/prometheus/pull/5882 that
we do so many repetitions because of this, for not good reason. I think
I found a good balance between convenience and readability with just one method.
Smaller the interface = better.
Also I don't know what TestSelectSorted was testing, but now it's testing sorting.
Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com>
This fixes#6992, which was introduced by #6777. There was an
intermediate component which translated TSDB errors into storage errors,
but that component was deleted and this bug went unnoticed, until we
were watching at the Prombench results. Without this, scrape will fail
instead of dropping samples or using "Add" when the series have been
garbage collected.
Signed-off-by: Julien Pivotto <roidelapluie@inuits.eu>
With defer having less of a performance penalty, there is no reason
not to do those crucial operations via defer.
Context: With isolation in place, if we forget to Commit/Rollback, the
low watermark will get stuck forever.
The current code should not have any bugs, but moving to defer helps
to avoid future bugs.
This is also moving the `closeAppend` in the `Commit` implementation
itself to defer. If logging to the WAL fails, we would have missed the
`closeAppend`.
Signed-off-by: beorn7 <beorn@grafana.com>
This is most likely due to an endpoint not producing valid
metrics output, which we should treat the same as a failed
scrape, and thus not spam the application logs with it.
Signed-off-by: Brian Brazil <brian.brazil@robustperception.io>