Commit graph

5512 commits

Author SHA1 Message Date
Callum Styan 6f69e31398 Tail the TSDB WAL for remote_write
This change switches the remote_write API to use the TSDB WAL.  This should reduce memory usage and prevent sample loss when the remote end point is down.

We use the new LiveReader from TSDB to tail WAL segments.  Logic for finding the tracking segment is included in this PR.  The WAL is tailed once for each remote_write endpoint specified. Reading from the segment is based on a ticker rather than relying on fsnotify write events, which were found to be complicated and unreliable in early prototypes.

Enqueuing a sample for sending via remote_write can now block, to provide back pressure.  Queues are still required to acheive parallelism and batching.  We have updated the queue config based on new defaults for queue capacity and pending samples values - much smaller values are now possible.  The remote_write resharding code has been updated to prevent deadlocks, and extra tests have been added for these cases.

As part of this change, we attempt to guarantee that samples are not lost; however this initial version doesn't guarantee this across Prometheus restarts or non-retryable errors from the remote end (eg 400s).

This changes also includes the following optimisations:
- only marshal the proto request once, not once per retry
- maintain a single copy of the labels for given series to reduce GC pressure

Other minor tweaks:
- only reshard if we've also successfully sent recently
- add pending samples, latest sent timestamp, WAL events processed metrics

Co-authored-by: Chris Marchbanks <csmarchbanks.com> (initial prototype)
Co-authored-by: Tom Wilkie <tom.wilkie@gmail.com> (sharding changes)
Signed-off-by: Callum Styan <callumstyan@gmail.com>
2019-02-12 11:39:13 +00:00
Maria Nemtinova 8e3a39f725 Web UI QoL improvements (#5201)
1. Added an ability to resize text area on mouseclick
2. Remember selected target status button on page reload

Signed-off-by: Maria Nemtinova <nemtinovamasha@gmail.com>
2019-02-12 00:22:05 +01:00
JoeWrightss 4cb6c202ff Fix fmt.Errorf error message (#5199)
Signed-off-by: zhoulin xie <zhoulin.xie@daocloud.io>
2019-02-10 15:16:20 +05:30
Tariq Ibrahim a2a6e24f9f show list of offending labels in the error message in many-to-many scenarios (#5189)
Signed-off-by: tariqibrahim <tariq181290@gmail.com>
2019-02-09 10:17:52 +01:00
Minh-Long Do b26b5c9e96 Add rendering test of template based web endpoints (#5188)
Signed-off-by: Minh-Long  Do <minhlong.langos@gmail.com>
2019-02-08 10:17:47 +00:00
Simon Pasquier fc10f6d814
Unset GO111MODULE variable in Makefile.common (#5191)
Signed-off-by: Simon Pasquier <spasquie@redhat.com>
2019-02-07 17:22:04 +01:00
Goutham Veeramachaneni 9b8bbe3246
Merge pull request #5187 from prometheus/beorn7/release
Merge v2.7 bugfixes into master
2019-02-06 21:32:06 +01:00
beorn7 d26e134bd4 Merge branch 'release-2.7' into beorn7/release
Signed-off-by: beorn7 <beorn@soundcloud.com>
2019-02-06 15:22:40 +01:00
Björn Rabenstein 3db36f34ec
Merge pull request #5186 from prometheus/beorn7/metrics
Fix prometheus_rule_group_last_evaluation_timestamp_seconds
2019-02-06 15:19:08 +01:00
beorn7 2db1eeb4ec Fix prometheus_rule_group_last_evaluation_timestamp_seconds
It should be a unix timestamp, not the seconds in the minute.

Signed-off-by: beorn7 <beorn@soundcloud.com>
2019-02-06 11:02:49 +01:00
zhulongcheng fd964426a7 web: predeclare and reuse errors (#5180)
Predeclare and reuse errors to reduce duplicate code

Signed-off-by: zhulongcheng <zhulongcheng.me@gmail.com>
2019-02-04 13:06:26 +01:00
zhulongcheng a75f8a8e05 update error message in extractTimeRange (#5179)
Update error message in the extractTimeRange function
to match function's logic

Signed-off-by: zhulongcheng <zhulongcheng.me@gmail.com>
2019-02-03 09:29:23 +00:00
JoeWrightss e158c53fa9 Fix some typos in comment (#5175)
Signed-off-by: zhoulin xie <zhoulin.xie@daocloud.io>
2019-02-01 14:35:32 +00:00
Brian Brazil c66aeb3fff
In histogram_quantile merge buckets with equivalent le values (#5158)
This makes things generally more resilient, and will
help with OpenMetrics transitions (and inconsistencies).

Signed-off-by: Brian Brazil <brian.brazil@robustperception.io>
2019-02-01 10:22:44 +00:00
Simon Pasquier a60431f3cd Merge v2.7.1 into master (#5170)
Signed-off-by: Simon Pasquier <spasquie@redhat.com>
2019-02-01 09:54:12 +01:00
Vishnunarayan K I 108b9b0e5f Limit number of merics in prometheus UI (#5139)
Signed-off-by: Vishnunarayan K I <appukuttancr@gmail.com>
2019-01-31 17:03:50 +00:00
Frederic Branczyk 50e1228f88
Merge pull request #5147 from prometheus/brancz-patch-1
docs: Add filesystem POSIX requirement
2019-01-31 16:20:46 +01:00
Frederic Branczyk 32079f351f
docs: Specifically call out NFS and POSIX
Signed-off-by: Frederic Branczyk <fbranczyk@gmail.com>
2019-01-31 12:57:48 +01:00
Goutham Veeramachaneni 62e591f928
*: cut 2.7.1 (#5164)
Signed-off-by: Goutham Veeramachaneni <gouthamve@gmail.com>
2019-01-31 12:13:25 +01:00
Goutham Veeramachaneni b03d6f6eff
Remove custom highlight code, it's not needed. (#5163)
Signed-off-by: Brian Brazil <brian.brazil@robustperception.io>
2019-01-31 11:27:18 +01:00
Ganesh Vernekar 10ae00ab9d Fix bug from #4898 (#5161)
Signed-off-by: Ganesh Vernekar <cs15btech11018@iith.ac.in>
2019-01-31 11:14:14 +01:00
Ganesh Vernekar 787eb1e904 Set rule_group_last_duration_seconds to seconds (#5153)
Signed-off-by: Ganesh Vernekar <cs15btech11018@iith.ac.in>
2019-01-31 11:07:58 +01:00
Frederic Branczyk 3de734d8de
docs: Add filesystem POSIX requirement
Signed-off-by: Frederic Branczyk <fbranczyk@gmail.com>
2019-01-29 13:51:16 +01:00
Ganesh Vernekar a2ef8cf2f5
Merge pull request #5146 from tariq1890/ineff
fix ineffectual assignment in dns.go
2019-01-29 11:50:34 +05:30
tariqibrahim b173de0c26 fix ineffectual assignment in dns.go
Signed-off-by: tariqibrahim <tariq181290@gmail.com>
2019-01-28 17:15:43 -08:00
Jannick Fahlbusch ฏ๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎ 63f375e80a [FIX] Azure DS: Return error when request failed (#4719)
This fixes the issue that the error is swallowed when the request failed.

Signed-off-by: Jannick Fahlbusch <git@jf-projects.de>
2019-01-28 21:31:45 +00:00
Brian Brazil 1dd57765b4
Reduce time that alertmanagers are in flux when reloaded. (#5126)
This no longer waits for all of the scrape reload to complete
before getting a list of AMs again.

Signed-off-by: Brian Brazil <brian.brazil@robustperception.io>
2019-01-28 18:34:12 +00:00
Bryan Boreham 8841692a63 Use the context associated with the inner evaluation span (#5130)
Signed-off-by: Bryan Boreham <bryan@weave.works>
2019-01-28 18:33:30 +00:00
Tariq Ibrahim f4275d2352 Use the latest versions of azure go sdk and go-autorest (#5015)
Signed-off-by: tariqibrahim <tariq181290@gmail.com>
2019-01-28 18:30:29 +00:00
Tariq Ibrahim bfcdba211f remove the prepended watch reactor from the fake k8s client (#5140)
Signed-off-by: tariqibrahim <tariq.ibrahim@microsoft.com>
2019-01-28 16:42:25 +01:00
Goutham Veeramachaneni 410ee9e04a
*: cut 2.7.0 (#5141)
Signed-off-by: Goutham Veeramachaneni <gouthamve@gmail.com>
2019-01-28 15:37:30 +05:30
Goutham Veeramachaneni 7f7b211047
*: cut 2.7.0-rc.2 (#5134)
Signed-off-by: Goutham Veeramachaneni <gouthamve@gmail.com>
2019-01-24 18:55:04 +05:30
Goutham Veeramachaneni b454ed3ec2
*: cut 2.7.0-rc.1 (#5123)
Signed-off-by: Goutham Veeramachaneni <gouthamve@gmail.com>
2019-01-21 18:47:37 +05:30
Goutham Veeramachaneni 4e83f91cfd
Rollback Dockerfile to version @ 2.5.x (#5122)
Fixes https://github.com/prometheus/prometheus/issues/5043

Signed-off-by: Goutham Veeramachaneni <gouthamve@gmail.com>
2019-01-21 17:27:16 +05:30
Hrishikesh Barman 9c4e258651 corrected regex string check for anyorigin(*) (#5117)
Signed-off-by: Hrishikesh Barman <hrishikeshbman@gmail.com>
2019-01-21 17:17:27 +05:30
Goutham Veeramachaneni 24f19f03db
*: cut 2.7.0-rc.0 (#5114)
Signed-off-by: Goutham Veeramachaneni <gouthamve@gmail.com>
2019-01-18 22:16:02 +05:30
Goutham Veeramachaneni 4068968e12
Protect retention from overflowing (#5112)
Also sanitise the max block duration to max a month.

Signed-off-by: Goutham Veeramachaneni <gouthamve@gmail.com>
2019-01-18 20:18:06 +05:30
Goutham Veeramachaneni 384cba1211
Add flag for size based retention (#5109)
* Add flag for size based retention

Signed-off-by: Goutham Veeramachaneni <gouthamve@gmail.com>

* Deprecate the old retention flag for a new one.

Signed-off-by: Goutham Veeramachaneni <gouthamve@gmail.com>

* Add ability to take a suffix for size flag

Signed-off-by: Goutham Veeramachaneni <gouthamve@gmail.com>

* Address feedback

Signed-off-by: Goutham Veeramachaneni <gouthamve@gmail.com>
2019-01-18 19:18:36 +05:30
Krasi Georgiev 3bd41cc92c Udpate tsdb to 0.4 (#5110)
* update tsdb to v0.4.0

Signed-off-by: Krasi Georgiev <kgeorgie@redhat.com>

* remove unused struct field

Signed-off-by: Krasi Georgiev <kgeorgie@redhat.com>
2019-01-18 16:32:14 +05:30
Simon Pasquier 68e4c211f2
discovery/azure: more robust handling of go routines (#5106)
Signed-off-by: Simon Pasquier <spasquie@redhat.com>
2019-01-18 09:55:47 +01:00
Hrishikesh Barman a1f34bec2e Added CORS Origin flag (#5011)
Signed-off-by: Hrishikesh Barman <hrishikeshbman@gmail.com>
2019-01-17 15:01:06 +00:00
Matt Layher c44cd7e166
Merge pull request #5102 from prometheus/mdl-gofmt
*: apply gofmt -s
2019-01-16 19:12:43 -05:00
Matt Layher 67c43f3054
Merge pull request #5101 from prometheus/mdl-no-fatal
pkg/runtime: use panic instead of log.Fatal for system call errors
2019-01-16 19:12:29 -05:00
Matt Layher a68d1f2688
Merge pull request #5100 from prometheus/mdl-lexer-subtests
promql: use subtests in TestLexer
2019-01-16 19:12:11 -05:00
Matt Layher 302148fd69 *: apply gofmt -s
Signed-off-by: Matt Layher <mdlayher@gmail.com>
2019-01-16 17:28:14 -05:00
Matt Layher 28eff63eca pkg/runtime: use panic instead of log.Fatal for system call errors
Signed-off-by: Matt Layher <mdlayher@gmail.com>
2019-01-16 17:03:30 -05:00
Matt Layher f62fd2bfc9 promql: use subtests in TestLexer
Signed-off-by: Matt Layher <mdlayher@gmail.com>
2019-01-16 16:39:32 -05:00
Ryan Leung 45c8b084c6 fix TestFailedStartupExitCode (#5076)
Signed-off-by: rleungx <rleungx@gmail.com>
2019-01-16 10:13:36 +01:00
Simon Pasquier 22a1def98d
Merge pull request #5099 from prometheus/release-2.6
Merge release-2.6 to master
2019-01-16 09:26:00 +01:00
Callum Styan 5358f76c5c update remote write path proto so that Labels/Timeseries can't be nil (#4957)
Signed-off-by: Callum Styan <callumstyan@gmail.com>
2019-01-15 19:13:39 +00:00