Commit graph

1068 commits

Author SHA1 Message Date
Annanay 89129cd39a Address comments
Signed-off-by: Annanay <annanayagarwal@gmail.com>
2020-07-30 16:41:13 +05:30
Annanay f40e4579b7 gofmt
Signed-off-by: Annanay <annanayagarwal@gmail.com>
2020-07-24 20:40:19 +05:30
Annanay 7f98a744e5 Add context to Appender interface
Signed-off-by: Annanay <annanayagarwal@gmail.com>
2020-07-24 19:40:51 +05:30
Zhou Hao ddedf454d0
add os.RemoveAll err verification (#7540)
* add os.RemoveAll err verification for watcher_test

Signed-off-by: Zhou Hao <zhouhao@cn.fujitsu.com>

* add os.RemoveAll err verification for db_test

Signed-off-by: Zhou Hao <zhouhao@cn.fujitsu.com>

* add os.RemoveAll err verification for write_test

Signed-off-by: Zhou Hao <zhouhao@cn.fujitsu.com>

* add os.RemoveAll err verification for queue_manager_test

Signed-off-by: Zhou Hao <zhouhao@cn.fujitsu.com>

* tsdb/wal/watcher_test: add close operation before delete

Signed-off-by: Zhou Hao <zhouhao@cn.fujitsu.com>
2020-07-17 11:47:32 +05:30
Hu Shuai b962029031
Add a unit test for MergeLabels in storage/remote/codec.go. (#7499)
This PR is about adding a unit test for MergeLabels in storage/remote/codec.go.

Signed-off-by: Hu Shuai <hus.fnst@cn.fujitsu.com>
2020-07-04 22:17:19 -06:00
Ganesh Vernekar a4c2ea1ca3
Merge remote-tracking branch 'upstream/master' into merge-release-2.19
Signed-off-by: Ganesh Vernekar <cs15btech11018@iith.ac.in>
2020-06-26 14:33:50 +05:30
Chris Marchbanks b299aba6cf
Fix panic when updating a remote write queue (#7452)
Right now Queue Manager metrics are registered when the metrics struct
is created, which happens before a changed queue is shutdown and the old
metrics are unregistered. In the case of named queues or updates to
external labels the apply config will panic due to duplicate metrics.

Instead, register the metrics as part of starting the queue as we always
guarantee that Stop will be called before a new Start.

Signed-off-by: Chris Marchbanks <csmarchbanks@gmail.com>
2020-06-26 12:03:52 +05:30
Chris Marchbanks d78656c244
Pending Samples metric includes samples in channel (#7335)
* Pending Samples metric includes samples in channel

The pending samples metric should also include samples waiting in the
channels to be sent to provide a more accurate measure. In addition,
make sure that the pending samples is reset to 0 anytime a queue is
started as we remake all of the shards at that time.

Signed-off-by: Chris Marchbanks <csmarchbanks@gmail.com>

* Log the number of dropped samples on hard shutdown

Signed-off-by: Chris Marchbanks <csmarchbanks@gmail.com>
2020-06-25 14:48:30 -06:00
Bartlomiej Plotka b788986717
storage: Adjusted fully storage layer support for chunk iterators: Remote read client, readyStorage, fanout. (#7059)
* Fixed nits introduced by https://github.com/prometheus/prometheus/pull/7334
* Added ChunkQueryable implementation to fanout and readyStorage.
* Added more comments.
* Changed NewVerticalChunkSeriesMerger to CompactingChunkSeriesMerger, removed tiny interface by reusing VerticalSeriesMergeFunc for overlapping algorithm for
both chunks and series, for both querying and compacting (!) + made sure duplicates are merged.
* Added ErrChunkSeriesSet
* Added Samples interface for seamless []promb.Sample to []tsdbutil.Sample conversion.
* Deprecating non chunks serieset based StreamChunkedReadResponses, added chunk one.
* Improved tests.
* Split remote client into Write (old storage) and read.
* Queryable client is now SampleAndChunkQueryable. Since we cannot use nice QueryableFunc I moved
all config based options to sampleAndChunkQueryableClient to aboid boilerplate.

In next commit: Changes for TSDB.

Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com>
2020-06-24 14:41:52 +01:00
Nicole J d5a8f2afc4
Added the remote read histogram (#7334)
change remote read queries total metric to a histogram and add read requests counter with status code

Signed-off-by: njingco <jingco.nicole@gmail.com>
2020-06-16 07:11:41 -07:00
Kemal Akkoyun 66dfb951c4
*: Consistent Error/Warning handling for SeriesSet iterator: Allowing Async Select (#7251)
* Add errors and Warnings to SeriesSet

Signed-off-by: Kemal Akkoyun <kakkoyun@gmail.com>

* Change Querier interface and refactor accordingly

Signed-off-by: Kemal Akkoyun <kakkoyun@gmail.com>

* Refactor promql/engine to propagate warnings at eval stage

Signed-off-by: Kemal Akkoyun <kakkoyun@gmail.com>

* Address review issues

Signed-off-by: Kemal Akkoyun <kakkoyun@gmail.com>

* Make sure all the series from all Selects are pre-advanced

Signed-off-by: Kemal Akkoyun <kakkoyun@gmail.com>

* Address review issues

Signed-off-by: Kemal Akkoyun <kakkoyun@gmail.com>

* Separate merge series sets

Signed-off-by: Kemal Akkoyun <kakkoyun@gmail.com>

* Clean

Signed-off-by: Kemal Akkoyun <kakkoyun@gmail.com>

* Refactor merge querier failure handling

Signed-off-by: Kemal Akkoyun <kakkoyun@gmail.com>

* Refactored and simplified fanout with improvements from incoming chunk iterator PRs.

* Secondary logic is hidden, instead of weird failed series set logic we had.
* Fanout is well commented
* Fanout closing record all errors
* MergeQuerier improved API (clearer)
* deferredGenericMergeSeriesSet is not needed as we return no samples anyway for failed series sets (next = false).

Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com>

* Fix formatting

Signed-off-by: Kemal Akkoyun <kakkoyun@gmail.com>

* Fix CI issues

Signed-off-by: Kemal Akkoyun <kakkoyun@gmail.com>

* Added final tests for error handling.

Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com>

* Addressed Brian's comments.

* Moved hints in populate to be allocated only when needed.
* Used sync.Once in secondary Querier to achieve all-or-nothing partial response logic.
* Select after first Next is done will panic.

NOTE: in lazySeriesSet in theory we could just panic, I think however we can
totally just return error, it will panic in expand anyway.

Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com>

* Utilize errWithWarnings

Signed-off-by: Kemal Akkoyun <kakkoyun@gmail.com>

* Fix recently introduced expansion issue

Signed-off-by: Kemal Akkoyun <kakkoyun@gmail.com>

* Add tests for secondary querier error handling

Signed-off-by: Kemal Akkoyun <kakkoyun@gmail.com>

* Implement lazy merge

Signed-off-by: Kemal Akkoyun <kakkoyun@gmail.com>

* Add name to test cases

Signed-off-by: Kemal Akkoyun <kakkoyun@gmail.com>

* Reorganize

Signed-off-by: Kemal Akkoyun <kakkoyun@gmail.com>

* Address review comments

Signed-off-by: Kemal Akkoyun <kakkoyun@gmail.com>

* Address review comments

Signed-off-by: Kemal Akkoyun <kakkoyun@gmail.com>

* Remove redundant warnings

Signed-off-by: Kemal Akkoyun <kakkoyun@gmail.com>

* Fix rebase mistake

Signed-off-by: Kemal Akkoyun <kakkoyun@gmail.com>

Co-authored-by: Bartlomiej Plotka <bwplotka@gmail.com>
2020-06-09 17:57:31 +01:00
Bartlomiej Plotka 8f0a924be9 Merge branch 'release-2.18' into get2.18.2-to-master 2020-06-09 10:04:44 +01:00
Marco Pracucci bc1d7d7f27 Fixed incorrect query results caused by buffer reuse in merge adapter (#7361)
Signed-off-by: Marco Pracucci <marco@pracucci.com>
2020-06-09 05:45:18 +01:00
Marco Pracucci 1e006d84a0
Fixed incorrect query results caused by buffer reuse in merge adapter (#7361)
Signed-off-by: Marco Pracucci <marco@pracucci.com>
2020-06-08 19:55:53 +01:00
Bert Hartmann 82c7cd320a
increase the remote write bucket range (#7323)
* increase the remote write bucket range

Increase the range of remote write buckets to capture times above 10s for laggy scenarios
Buckets had been: {.005, .01, .025, .05, .1, .25, .5, 1, 2.5, 5, 10}
Buckets are now: {0.03125, 0.0625, 0.125, 0.25, 0.5, 1, 2, 4, 8, 16, 32, 64, 128, 256, 512}

Signed-off-by: Bert Hartmann <berthartm@gmail.com>

* revert back to DefBuckets with addons to be backwards compatible

Signed-off-by: Bert Hartmann <berthartm@gmail.com>

* shuffle the buckets to maintain 2-2.5x increases

Signed-off-by: Bert Hartmann <berthartm@gmail.com>
2020-06-04 13:54:47 -06:00
Nicole J 5e9bd17b1f
added the prometheus_remote_storage_remote_read_queries_total (#7328)
* added the prometheus_remote_storage_remote_read_queries_total query

Signed-off-by: njingco <jingco.nicole@gmail.com>

* adjusted the help label of remoteReadQueriesTotal

Signed-off-by: njingco <jingco.nicole@gmail.com>
2020-06-03 10:30:52 -07:00
Cody Boggs 3268eac2dd
Trace Remote Write requests (#7206)
* Trace Remote Write requests

Signed-off-by: Cody Boggs <cboggs@splunk.com>

* Refactor store attempts to keep code flow clearer, and avoid so many places to deal with span finishing

Signed-off-by: Cody Boggs <cboggs@splunk.com>
2020-06-01 09:21:13 -06:00
Chris Marchbanks c1f9917e90
Add test for unregistering queue manager metrics
Signed-off-by: Chris Marchbanks <csmarchbanks@gmail.com>
2020-05-05 14:14:04 -06:00
Chris Marchbanks dfad1da296
Remove duplicate metrics in QueueManager
Right now any new metrics added for remote write need to be added to
both the QueueManager struct, and the queueManagerMetrics struct.
Instead, use the queueManagerMetrics struct directly from QueueManager.

The newQueueManagerMetrics constructor will now create the metrics for a
specific queue with name and endpoint pre-populated, and a new copy of
the struct will be created specifically for each queue.

This also fixes a bug where prometheus_remote_storage_sent_bytes_total
is not being unregistered after a queue is changed.

Signed-off-by: Chris Marchbanks <csmarchbanks@gmail.com>
2020-05-05 14:13:59 -06:00
Julien Pivotto 7ecd2d1c24
Jaeger: Create child span for remote read (#7187)
* Jaeger: Create child span for remote read
* Jaeger: use middleware to trace client http request

Signed-off-by: Julien Pivotto <roidelapluie@inuits.eu>
2020-05-02 22:41:55 +02:00
qinng f36ae1c21c
[remote-storage] use warn log level when send samples to remote failed (#7184)
[remote] increasing sendbatch error log level

Signed-off-by: guoruyi1 <guoruyi1@xiaomi.com>
Co-authored-by: guoruyi1 <guoruyi1@xiaomi.com>
2020-04-30 17:06:22 -06:00
Vasily Sliouniaev 0393b188c9
Add Jaeger (#7148)
* Trace remote read

Signed-off-by: vas <vasily.sliouniaev@jet.com>

* Use jaeger

Signed-off-by: vas <vasily.sliouniaev@jet.com>
2020-04-23 02:05:55 +02:00
Marek Slabicki 4b5e7d4984
Adding a shouldReshard function to modularize logic for the QueueManager deciding if it should shard or not (#7143)
Signed-off-by: Marek Slabicki <thaniri@gmail.com>
2020-04-20 16:20:39 -06:00
Chris Marchbanks cd12f0873c
Merge pull request #7073 from csmarchbanks/fix-md5-remote-write
Fix remote write not updating when relabel configs or secrets change
2020-04-16 16:36:25 -06:00
Chris Marchbanks 5ab6b043c1
Always update lastSendTimestamp after a request (#7122)
If the server is returning non-recoverable errors, such as if we are
trying to push samples that are too old, remote write will never
reshard. Non-recoverable errors should be treated the same as success
for the purpose of resharding, just as we do with sample rates and
durations.

Signed-off-by: Chris Marchbanks <csmarchbanks@gmail.com>
2020-04-15 09:03:28 -06:00
ZouYu 2b7437d60e
Fix some warnings: 'redundant type from array, slice, or map composite literal' (#7109)
Signed-off-by: ZouYu <zouy.fnst@cn.fujitsu.com>
2020-04-15 11:17:41 +01:00
Chris Marchbanks d88a2b0261 Handle secret changes in remote write ApplyConfig
Remake the http client whenever ApplyConfig is called. This allows
secrets to be updated without needing to restart an otherwise unchanged
queue.

Signed-off-by: Chris Marchbanks <csmarchbanks@gmail.com>
2020-04-13 23:14:15 +00:00
Simon Pasquier 317e73de79 Hash YAML instead of JSON
But it doesn't work either because of secret fields.

Signed-off-by: Simon Pasquier <spasquie@redhat.com>
2020-04-13 22:32:37 +00:00
Simon Pasquier 8cc84660fa storage/remote: add tests for config changes
Signed-off-by: Simon Pasquier <spasquie@redhat.com>
2020-04-13 22:32:37 +00:00
Marek Slabicki 8224ddec23
Capitalizing first letter of all log lines (#7043)
Signed-off-by: Marek Slabicki <thaniri@gmail.com>
2020-04-11 09:22:18 +01:00
ga 9a21fdcd1b
[storage] clean imports (#7099)
Signed-off-by: Gaurav Singh <gaurav1086@gmail.com>
2020-04-07 22:05:39 +01:00
Callum Styan be13a4ba7e
Compare querier storage to primary storage via reflect.DeepEqual. (#7050)
Signed-off-by: Callum Styan <callumstyan@gmail.com>
2020-03-25 18:42:32 +00:00
Bartlomiej Plotka d5c33877f9
storage: Added Chunks{Queryable/Querier/SeriesSet/Series/Iteratable. Added generic Merge{SeriesSet/Querier} implementation. (#7005)
* storage: Added Chunks{Queryable/Querier/SeriesSet/Series/Iteratable. Added generic Merge{SeriesSet/Querier} implementation.

## Rationales:

In many places (e.g. chunk Remote read, Thanos Receive fetching chunk from TSDB), we operate on encoded chunks not samples.
This means that we unnecessary decode/encode, wasting CPU, time and memory.
This PR adds chunk iterator interfaces and makes the merge code to be reused between both seriesSets

I will make the use of it in following PR inside tsdb itself. For now fanout implements it and mergers.

All merges now also allows passing series mergers. This opens doors for custom deduplications other than TSDB vertical ones (e.g. offline one we have in Thanos).

## Changes

* Added Chunk versions of all iterating methods. It all starts in Querier/ChunkQuerier. The plan is that
Storage will implement both chunked and samples.
* Added Seek to chunks.Iterator interface for iterating over chunks.
* NewMergeChunkQuerier was added; Both this and NewMergeQuerier are now using generigMergeQuerier to share the code. Generic code was added.
* Improved tests.
* Added some TODO for further simplifications in next PRs.

Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com>

* Addressed Brian's comments.

Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com>

* Moved s/Labeled/SeriesLabels as per Krasi suggestion.

Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com>

* Addressed Krasi's comments.

Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com>

* Second iteration of Krasi comments.

Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com>

* Another round of comments.

Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com>
2020-03-24 20:15:47 +00:00
beorn7 526cff39b9 Fix tests that were broken by #7009
Signed-off-by: beorn7 <beorn@grafana.com>
2020-03-20 21:22:58 +01:00
Bartlomiej Plotka c4eefd1b3a storage: Removed SelectSorted method; Simplified interface; Added requirement for remote read to sort response.
This is technically BREAKING CHANGE, but it was like this from the beginning: I just notice that we rely in
Prometheus on remote read being sorted. This is because we use selected data from remote reads in MergeSeriesSet
which rely on sorting.

I found during work on https://github.com/prometheus/prometheus/pull/5882 that
we do so many repetitions because of this, for not good reason. I think
I found a good balance between convenience and readability with just one method.
Smaller the interface = better.

Also I don't know what TestSelectSorted was testing, but now it's testing sorting.

Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com>
2020-03-20 21:14:43 +01:00
Callum Styan f802f1e8ca
Fix bug with WAL watcher and Live Reader metrics usage. (#6998)
* Fix bug with WAL watcher and Live Reader metrics usage.

Calling NewXMetrics when creating a Watcher or LiveReader results in a
registration error, which we're ignoring, and as a result other than the
first Watcher/Reader created, we had no metrics for either. So we would
only have metrics like Watcher Records Read for the first remote write
config in a users config file.

Signed-off-by: Callum Styan <callumstyan@gmail.com>
2020-03-20 17:34:15 +01:00
Chris Marchbanks 3128875ff4
Fix panic when a remote read store errors (#6975)
Signed-off-by: Chris Marchbanks <csmarchbanks@gmail.com>
2020-03-13 20:57:29 +01:00
beorn7 f6f4fd6556 tsdb: Do a full rollback upon commit error
I think the previous behavior is problematic as it will leave
`memSeries` around that still have `pendingCommit` set to `true`.

The only case where this can happen in this code path is a failure to
write to the WAL, in which case we are probably in trouble anyway. I
believe, however, we should still try to do the right thing and do the
full rollback. This will implicitly try to write to the WAL again, but
this time without samples, which may even succeed. (But we propagate
the previous error in any case.)

This also adds `a.head.putSeriesBuffer(a.sampleSeries)` to Rollback,
which was previously missing.

Signed-off-by: beorn7 <beorn@grafana.com>
2020-03-10 14:54:41 +01:00
Callum Styan 1518083168
Rw testability improvements (#6537)
* Change createTimeseries to take values for number of series and number
of samples per series.

Signed-off-by: Callum Styan <callumstyan@gmail.com>

* Take num of samples to expect in expectSampleCount instead of array of
samples.

Signed-off-by: Callum Styan <callumstyan@gmail.com>

* Add field to TestStorageClient to ignore samples sent waitgroup for
potential tests where we don't care about delivery of all samples.

Signed-off-by: Callum Styan <callumstyan@gmail.com>

* Fix up tests a little bit.

Signed-off-by: Callum Styan <callumstyan@gmail.com>
2020-02-25 11:10:57 -08:00
李国忠 f7a322fa6d
[comments] change the word "liike" to "like" (#6859)
Signed-off-by: fuling <fuling.lgz@alibaba-inc.com>
2020-02-22 20:48:54 +00:00
Bartlomiej Plotka 59c9d6ef45 Addressed Brian's comments, moved metrics to main.go
Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com>
2020-02-17 18:03:57 +00:00
Bartlomiej Plotka cfba92a133 Addressed comments.
Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com>
2020-02-17 18:03:57 +00:00
Bartlomiej Plotka fb79f515fc Fixed second bug.
Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com>
2020-02-17 18:03:57 +00:00
Bartlomiej Plotka 2cf637fbf5 Addressed comments.
Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com>
2020-02-17 18:03:57 +00:00
Bartlomiej Plotka 34426766d8 Unify Iterator interfaces. All point to storage now.
This is part of https://github.com/prometheus/prometheus/pull/5882 that can be done to simplify things.
All todos I added will be fixed in follow up PRs.

* querier.Querier, querier.Appender, querier.SeriesSet, and querier.Series interfaces merged
with storage interface.go. All imports that.
* querier.SeriesIterator replaced by chunkenc.Iterator
* Added chunkenc.Iterator.Seek method and tests for xor implementation (?)
* Since we properly handle SelectParams for Select methods I adjusted min max
based on that. This should help in terms of performance for queries with functions like offset.
* added Seek to deletedIterator and test.
* storage/tsdb was removed as it was only a unnecessary glue with incompatible structs.

No logic was changed, only different source of abstractions, so no need for benchmarks.

Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com>
2020-02-17 18:03:54 +00:00
李国忠 3cd6a5b050
Storage concurrently tests and bug fix (#6808)
* Storage concurrently tests and bug fix

Signed-off-by: fuling <fuling.lgz@alibaba-inc.com>
2020-02-12 08:58:52 +00:00
李国忠 40dd13b074
Storage concurrently (#6770)
* Storage concurrently
Signed-off-by: fuling <fuling.lgz@alibaba-inc.com>
2020-02-11 17:19:34 +00:00
helenxu1221 7df4fe3faa reset counter after collecting metric (#6798)
Signed-off-by: HelenXu <helenxu@Helens-MacBook-Pro.local>
2020-02-09 20:51:21 -07:00
Robert Fratto a53e00f9fd
pass registerer from storage to queue manager for its metrics (#6728)
* pass registerer from storage to queue manager for its metrics

Signed-off-by: Robert Fratto <robert.fratto@grafana.com>
2020-02-03 13:47:03 -08:00
Peter Štibraný 08c5549055
Document that NewMergeSeriesSet expects individual sets to be sorted. (#6718)
Signed-off-by: Peter Štibraný <peter.stibrany@grafana.com>
2020-01-30 19:30:32 +05:30