Commit graph

437 commits

Author SHA1 Message Date
Callum Styan 6f69e31398 Tail the TSDB WAL for remote_write
This change switches the remote_write API to use the TSDB WAL.  This should reduce memory usage and prevent sample loss when the remote end point is down.

We use the new LiveReader from TSDB to tail WAL segments.  Logic for finding the tracking segment is included in this PR.  The WAL is tailed once for each remote_write endpoint specified. Reading from the segment is based on a ticker rather than relying on fsnotify write events, which were found to be complicated and unreliable in early prototypes.

Enqueuing a sample for sending via remote_write can now block, to provide back pressure.  Queues are still required to acheive parallelism and batching.  We have updated the queue config based on new defaults for queue capacity and pending samples values - much smaller values are now possible.  The remote_write resharding code has been updated to prevent deadlocks, and extra tests have been added for these cases.

As part of this change, we attempt to guarantee that samples are not lost; however this initial version doesn't guarantee this across Prometheus restarts or non-retryable errors from the remote end (eg 400s).

This changes also includes the following optimisations:
- only marshal the proto request once, not once per retry
- maintain a single copy of the labels for given series to reduce GC pressure

Other minor tweaks:
- only reshard if we've also successfully sent recently
- add pending samples, latest sent timestamp, WAL events processed metrics

Co-authored-by: Chris Marchbanks <csmarchbanks.com> (initial prototype)
Co-authored-by: Tom Wilkie <tom.wilkie@gmail.com> (sharding changes)
Signed-off-by: Callum Styan <callumstyan@gmail.com>
2019-02-12 11:39:13 +00:00
Brian Brazil 1dd57765b4
Reduce time that alertmanagers are in flux when reloaded. (#5126)
This no longer waits for all of the scrape reload to complete
before getting a list of AMs again.

Signed-off-by: Brian Brazil <brian.brazil@robustperception.io>
2019-01-28 18:34:12 +00:00
Goutham Veeramachaneni 4068968e12
Protect retention from overflowing (#5112)
Also sanitise the max block duration to max a month.

Signed-off-by: Goutham Veeramachaneni <gouthamve@gmail.com>
2019-01-18 20:18:06 +05:30
Goutham Veeramachaneni 384cba1211
Add flag for size based retention (#5109)
* Add flag for size based retention

Signed-off-by: Goutham Veeramachaneni <gouthamve@gmail.com>

* Deprecate the old retention flag for a new one.

Signed-off-by: Goutham Veeramachaneni <gouthamve@gmail.com>

* Add ability to take a suffix for size flag

Signed-off-by: Goutham Veeramachaneni <gouthamve@gmail.com>

* Address feedback

Signed-off-by: Goutham Veeramachaneni <gouthamve@gmail.com>
2019-01-18 19:18:36 +05:30
Hrishikesh Barman a1f34bec2e Added CORS Origin flag (#5011)
Signed-off-by: Hrishikesh Barman <hrishikeshbman@gmail.com>
2019-01-17 15:01:06 +00:00
Matt Layher 302148fd69 *: apply gofmt -s
Signed-off-by: Matt Layher <mdlayher@gmail.com>
2019-01-16 17:28:14 -05:00
Ryan Leung 45c8b084c6 fix TestFailedStartupExitCode (#5076)
Signed-off-by: rleungx <rleungx@gmail.com>
2019-01-16 10:13:36 +01:00
Lv Jiawei b8ede99767 Fix comment typo (#5087)
According to code, I think it is a typo.

Signed-off-by: MIBc <lvjiawei@cmss.chinamobile.com>
2019-01-09 10:56:47 +00:00
Frederic Branczyk e9ae0b5a1b
Merge pull request #4927 from tariq1890/update_k8s
update client-go to v10.0.0 and other k8s deps to v1.13.1
2019-01-07 10:54:34 +01:00
Simon Pasquier f678e27eb6
*: use latest release of staticcheck (#5057)
* *: use latest release of staticcheck

It also fixes a couple of things in the code flagged by the additional
checks.

Signed-off-by: Simon Pasquier <spasquie@redhat.com>

* Use official release of staticcheck

Also run 'go list' before staticcheck to avoid failures when downloading packages.

Signed-off-by: Simon Pasquier <spasquie@redhat.com>
2019-01-04 14:47:38 +01:00
tariqibrahim 9b4a25e7b0 use klog dependency
Signed-off-by: tariqibrahim <tariq181290@gmail.com>
2019-01-03 13:57:20 -08:00
glutamatt 5ddde1965b tune the "Wal segment size" with a flag (#5029)
Add WALSegmentSize as an option, and the corresponding flag "storage.tsdb.wal-segment-size" to tune the max size of wal segment files.

The addressed base problem is to reduce the disk space used by wal segment files : on a raspberry pi, for instance, we often want to reduce write load of the sd card, then, the wal directory is mounted on a memory (space limited) partition.

the default value of the segment max file size, pushed the size of directory to 128 MB for each segment , which is too much ram consumption on a rasp.

the initial discussion is at https://github.com/prometheus/tsdb/pull/450
2019-01-03 17:13:21 +03:00
Ganesh Vernekar 7d30ccd0eb Sort samples before comparing - PromQL unit test (#5052)
Signed-off-by: Ganesh Vernekar <cs15btech11018@iith.ac.in>
2018-12-31 10:55:49 +00:00
Ganesh Vernekar dbe55c1352 Subquery (#4831)
Signed-off-by: Ganesh Vernekar <cs15btech11018@iith.ac.in>
2018-12-22 13:47:13 +00:00
Simon Pasquier a2766a94a3 cmd/prometheus: add tests for sendAlerts() (#4910)
Signed-off-by: Simon Pasquier <spasquie@redhat.com>
2018-12-18 11:15:46 +00:00
AixesHunter 1b166d7174 Fix variable 'notifier' collides with imported package name 'github.com/prometheus/prometheus/notifier', changed to 'notifierManager'. (#4947)
Signed-off-by: aixeshunter <aixeshunter@gmail.com>
2018-12-18 11:13:18 +00:00
Ganesh Vernekar fbadd88ba5 Get unique eval times for alert unit tests (#4964)
Signed-off-by: Ganesh Vernekar <cs15btech11018@iith.ac.in>
2018-12-18 08:40:03 +00:00
Simon Pasquier ac9d5f3d53
cmd/prometheus: replace glog by glog-gokit (#4931)
Signed-off-by: Simon Pasquier <spasquie@redhat.com>
2018-12-04 15:01:12 +01:00
Krasi Georgiev 080e6ed31a
collect cpu and trace profiles with the promtool debug command (#4897)
Signed-off-by: Krasi Georgiev <kgeorgie@redhat.com>
2018-11-23 17:57:31 +02:00
Alex Yu 5dcce32ef8 update promlog to latest version (#4876)
* update promlog to latest version

Signed-off-by: Alex Yu <yu.alex96@gmail.com>

* Update api tests, fix main setup

Signed-off-by: Alex Yu <yu.alex96@gmail.com>

* tidy go.sum

Signed-off-by: Alex Yu <yu.alex96@gmail.com>

* revendor prometheus/common

Signed-off-by: Alex Yu <yu.alex96@gmail.com>

* only initialize config; use kingpin for remote_storage_adapter

Signed-off-by: Alex Yu <yu.alex96@gmail.com>

* actually parse the flags

Signed-off-by: Alex Yu <yu.alex96@gmail.com>

* clean up imports

Signed-off-by: Alex Yu <yu.alex96@gmail.com>
2018-11-23 14:22:40 +01:00
Ganesh Vernekar cfb3769274 Lazily load samples for unit testing (#4851)
* Lazily load samples for unit testing

Signed-off-by: Ganesh Vernekar <cs15btech11018@iith.ac.in>

* cleanup

Signed-off-by: Ganesh Vernekar <cs15btech11018@iith.ac.in>
2018-11-22 14:21:38 +05:30
achiuBAE a9050c45f6 Allow setting the Prometheus instance document title through a flag. (#4841)
* web: added ability to set page title through flag.

Signed-off-by: Andrew Chiu <andrew.chiu2@baesystems.com>

* Reformatted variable names and Flag description for readability.

Signed-off-by: Andrew Chiu <andrew.chiu2@baesystems.com>

* assets_vfsdata.go

Signed-off-by: Andrew Chiu <andrew.chiu2@baesystems.com>

* Flag name changed from web.ui-title to web.page-title

Signed-off-by: Andrew Chiu <andrew.chiu2@baesystems.com>

* make assets

Signed-off-by: Andrew Chiu <andrew.chiu2@baesystems.com>
2018-11-21 12:45:06 +08:00
stuart nelson 6a69471bc2
[promtool] Support writing output as json (#4848)
* Support writing output as json

Oftentimes I'll want to execute something based on
the output from promtool, and supporting json
makes it easy to pull out values with a supporting
tool such as jq.

Signed-off-by: stuart nelson <stuartnelson3@gmail.com>
2018-11-14 18:40:07 +01:00
Lucas Serven 70c8b2c63c
cmd/prometheus: buffer signal chans
According to the GoDoc for os.Signal [0]:

> Package signal will not block sending to c: the caller must ensure that
> c has sufficient buffer space to keep up with the expected signal rate.
> For a channel used for notification of just one signal value, a buffer
> of size 1 is sufficient.

[0] https://golang.org/pkg/os/signal/#Notify

Signed-off-by: Lucas Serven <lserven@gmail.com>
2018-11-14 10:33:28 +01:00
Frederic Branczyk bda9781ccd
Merge pull request #3839 from brancz/remove-old-alert-record
promql: Remove old and unused alerting/reconding syntax
2018-11-06 15:53:27 +01:00
Simon Pasquier a30348f1a4 discovery: add config label to discovered targets metric (#4753)
* discovery: add labels to discovered targets metric

Signed-off-by: Simon Pasquier <spasquie@redhat.com>
2018-10-18 16:46:59 +01:00
Callum Styan 9bca041285 WIP: keep track of samples per query, set a max # of samples (#4513)
* keep track of samples per query, set a max # of samples that can be in
memory at once

Signed-off-by: Callum Styan <callumstyan@gmail.com>
2018-10-02 12:59:19 +01:00
Tom Wilkie 4c52400708
Limit concurrent remote reads. (#4656)
Signed-off-by: Tom Wilkie <tom.wilkie@gmail.com>
2018-09-25 20:07:34 +01:00
Ganesh Vernekar 5790d23fd8 Unit testing for rules (#4350)
* Unit testing for rules
* Specifying order of group evaluation in unit tests

Signed-off-by: Ganesh Vernekar <cs15btech11018@iith.ac.in>
2018-09-25 17:06:26 +01:00
Tom Wilkie 457e4bb58e
Limit the number of samples remote read can return. (#4532)
* Limit the number of samples remote read can return.

- Return 413 entity too large.
- Limit can be set be a flag.  Allow 0 to mean no limit.
- Include limit in error message.
- Set default limit to 50M (* 16 bytes = 800MB).

Signed-off-by: Tom Wilkie <tom.wilkie@gmail.com>
2018-09-05 15:50:50 +02:00
Chris Marchbanks 63ed9d1b70 Send EndsAt along with alerts (#4550)
Signed-off-by: Chris Marchbanks <csmarchbanks@gmail.com>
2018-08-28 16:05:00 +01:00
Chris Marchbanks 87f1dad16d throttle resends of alerts to 1 minute by default (#4538)
Signed-off-by: Chris Marchbanks <csmarchbanks@gmail.com>
2018-08-27 17:41:42 +01:00
Krasi Georgiev 12fe204ea6
move runtime debug funcs in own package (#4494)
To make local debuging with `go run` easyer moved all files into a
dedicate package `runtime`.
This allows running prometheus just by using `go run main.go` instead of
passing mani files like `go run main.go limits_default.go ...`

Signed-off-by: Krasi Georgiev <kgeorgie@redhat.com>
2018-08-22 13:41:11 +03:00
Simon Pasquier 08c2f50382
Merge pull request #4418 from simonpasquier/log-vm-limits
prometheus: log virtual memory limits
2018-08-07 16:27:46 +02:00
Frederic Branczyk b0b3e3dd74
promql: Remove old and unused alerting/reconding syntax
Signed-off-by: Frederic Branczyk <fbranczyk@gmail.com>
2018-08-07 15:14:06 +02:00
Dave Henderson 73a08f0045 promtool - Adding --step flag to 'query range' subcommand (#4454)
Signed-off-by: Dave Henderson <dhenderson@gmail.com>
2018-08-05 11:03:18 +02:00
Julius Volz 90521a65f8
Remove error return value from NotifyFunc() (#4459)
It's always nil and we also forgot to check it.

Signed-off-by: Julius Volz <julius.volz@gmail.com>
2018-08-04 21:31:12 +02:00
Ganesh Vernekar f1db699dff Persist alert 'for' state across restarts (#4061)
Signed-off-by: Ganesh Vernekar <cs15btech11018@iith.ac.in>
2018-08-02 11:18:24 +01:00
Simon Pasquier a94450c288 Fix build for openbsd
Signed-off-by: Simon Pasquier <spasquie@redhat.com>
2018-07-31 14:41:30 +02:00
Simon Pasquier 141c188ae6 Enforce conversion for freebsd
Signed-off-by: Simon Pasquier <spasquie@redhat.com>
2018-07-26 14:58:56 +02:00
Simon Pasquier 208d21a393 Add comment and print units
Signed-off-by: Simon Pasquier <spasquie@redhat.com>
2018-07-26 10:26:58 +02:00
Simon Pasquier ba22b10113 prometheus: log virtual memory limits
Signed-off-by: Simon Pasquier <spasquie@redhat.com>
2018-07-25 15:51:27 +02:00
Daisy T a3376e8f36 add query labels command to promtool (#4346)
Signed-off-by: Daisy T <daisyts@gmx.com>
2018-07-18 16:27:28 +02:00
Julius Volz 95dfb1b1dd
Add missing import to promtool, fix build (#4395)
Sorry, I used GitHub's web-based merge-conflict-resolution editor on
https://github.com/prometheus/prometheus/pull/4308 and it didn't show me
test errors afterwards, but maybe they didn't run again or I should have
waited or something.

Signed-off-by: Julius Volz <julius.volz@gmail.com>
2018-07-18 10:26:45 +02:00
Shubheksha 125da3b812 promtool: add command for querying series (#4308)
Signed-off-by: Shubheksha Jalan <jshubheksha@gmail.com>
2018-07-18 10:15:58 +02:00
Julius Volz 03aa3a3de8
main: Improve / clean up error messages (#4286)
Signed-off-by: Julius Volz <julius.volz@gmail.com>
2018-07-18 09:58:40 +02:00
Chih-Hung Yeh 912d19fb85 Add 3 commands in promtool for getting debug information from prometheus server (#4247)
`debug all` - all information
`debug metrics` - metrics  information
`debug pprof` - profiling  information

the final result is compressed in a `tar.gz` file

Signed-off-by: chyeh <chyeh.taiwan@gmail.com>
2018-07-18 10:52:01 +03:00
Brian Brazil 68e8b80ffe
Reorder startup and shutdown to prevent panics. (#4321)
Start rule manager only after tsdb and config is loaded.
Stop rule manager before tsdb to avoid writing to closed storage.
Wait for any in-progress reloads to complete before shutting
down rule manager, so that rule manager doesn't get updated after
being shut down.

Remove incorrect comment around shutting down query enginge.
Log when config reload is completed.

Fixes #4133
Fixes #4262

Signed-off-by: Brian Brazil <brian.brazil@robustperception.io>
2018-07-04 13:41:16 +01:00
Michael Khalil 78e0784d04 return error exit status in prometheus cli (#4296)
Signed-off-by: mikeykhalil <mikeyfkhalil@gmail.com>
2018-06-21 08:32:26 +01:00
Tom Wilkie 8acad5f3cd make it compile
Signed-off-by: Tom Wilkie <tom.wilkie@gmail.com>
2018-05-24 15:40:24 +01:00
Tom Wilkie e51d6c4b6c Make remote flush deadline a command line param.
Signed-off-by: Tom Wilkie <tom.wilkie@gmail.com>
2018-05-23 15:06:01 +01:00
Sneha Inguva c1a851074b promtool: add query instant and query range commands (#4085)
* promtool: add QueryInstant and QueryRange cmds

* promtool: add more query functions

* promtool: finished query Instant

* promtool: add range query

* promtool: add query command and address arguments

* vendor client and api
2018-04-26 20:41:56 +02:00
Mario Trangoni 464e747f1e fix some comments typos (#4059) 2018-04-08 10:51:54 +01:00
Sneha Inguva 7be846754a main: actor functionality comments 2018-04-01 11:19:30 -07:00
Marek Siarkowicz bb86c3f62b Report internal runtime information on status page (#3921)
Add information about tsdb, wal and config reload
2018-03-21 16:08:37 +00:00
James Turnbull ba5273a0ab Minor edits to help text (#3990) 2018-03-20 16:54:36 +00:00
Simon Pasquier e1fd96db25 cmd: fix help text (#3989) 2018-03-20 15:58:19 +00:00
ferhat elmas ffa673f7d8 General simplifications (#3887)
Another try as in #1516
2018-02-26 07:58:10 +00:00
Bartek Plotka 93a63ac5fd api: Added v1/status/flags endpoint. (#3864)
Endpoint URL: /api/v1/status/flags
Example Output:
```json
{
  "status": "success",
  "data": {
    "alertmanager.notification-queue-capacity": "10000",
    "alertmanager.timeout": "10s",
    "completion-bash": "false",
    "completion-script-bash": "false",
    "completion-script-zsh": "false",
    "config.file": "my_cool_prometheus.yaml",
    "help": "false",
    "help-long": "false",
    "help-man": "false",
    "log.level": "info",
    "query.lookback-delta": "5m",
    "query.max-concurrency": "20",
    "query.timeout": "2m",
    "storage.tsdb.max-block-duration": "36h",
    "storage.tsdb.min-block-duration": "2h",
    "storage.tsdb.no-lockfile": "false",
    "storage.tsdb.path": "data/",
    "storage.tsdb.retention": "15d",
    "version": "false",
    "web.console.libraries": "console_libraries",
    "web.console.templates": "consoles",
    "web.enable-admin-api": "false",
    "web.enable-lifecycle": "false",
    "web.external-url": "",
    "web.listen-address": "0.0.0.0:9090",
    "web.max-connections": "512",
    "web.read-timeout": "5m",
    "web.route-prefix": "/",
    "web.user-assets": ""
  }
}
```

Signed-off-by: Bartek Plotka <bwplotka@gmail.com>
2018-02-21 08:49:02 +00:00
Fabian Reinartz 7ccd4b39b8 *: implement query params
This adds a parameter to the storage selection interface which allows
query engine(s) to pass information about the operations surrounding a
data selection.
This can for example be used by remote storage backends to infer the
correct downsampling aggregates that need to be provided.
2018-02-13 12:17:22 +01:00
Conor Broderick 5169ccf258
Merge pull request #3724 from simonpasquier/fix-bad-data-error
Don't reset FiredAt for inactive alerts
2018-02-01 16:18:09 +00:00
Krasi Georgiev b75428ec19 rename package retrieve to scrape
no fucnctinal changes just renaming retrieval to scrape
2018-02-01 09:55:07 +00:00
Krasi Georgiev 7858745c04 rename structs for consistency 2018-01-30 17:49:05 +00:00
Krasi Georgiev acc4197098 remove dicovery race for the context field 2018-01-29 15:18:07 +00:00
Julien Pivotto 8b20cb1e8d last config success time gauge: use SetToCurrentTime() (#3750)
Signed-off-by: Julien Pivotto <roidelapluie@inuits.eu>
2018-01-27 07:48:13 +00:00
Simon Pasquier 81c0ab69e0 Don't reset FiredAt for inactive alerts
Otherwise AlertManager receives resolved alerts where StartsAt is zero which
fails the validation.
2018-01-22 17:17:33 +01:00
Krasi Georgiev 719c579f7b refactor main execution reloadReady handling, update some comments 2018-01-17 18:14:24 +00:00
Krasi Georgiev 0eafaf32d3 set the correct config reloading execution for scraper and notifier 2018-01-17 13:06:56 +00:00
Krasi Georgiev 97f0461e29 refactor the config reloading execution 2018-01-17 12:02:13 +00:00
Krasi Georgiev 5260c650ec use the config hash for the map lookup 2018-01-16 11:10:54 +00:00
Krasi Georgiev 8369826808 comment to rethink the map reference for the notifier discovery 2018-01-16 09:47:53 +00:00
Krasi Georgiev d12e6f29fc discovery manager ApplyConfig now takes a direct ServiceDiscoveryConfig so that it can be used for the notify manager
reimplement the service discovery for the notify manager

Signed-off-by: Krasi Georgiev <krasi.root@gmail.com>
2018-01-15 13:39:44 +00:00
Shubheksha Jalan 0471e64ad1 Use shared types from the common repo (#3674)
* refactor: use shared types from common repo, remove util/config

* vendor: add common/config

* fix nit
2018-01-11 16:10:25 +01:00
Goutham Veeramachaneni 35a6ffbaf3
Merge pull request #3587 from krasi-georgiev/web-test-error-check
handle web_test webhandler errors.
2018-01-10 22:03:25 +05:30
Shubheksha Jalan ec94df49d4 Refactor SD configuration to remove config dependency (#3629)
* refactor: move targetGroup struct and CheckOverflow() to their own package

* refactor: move auth and security related structs to a utility package, fix import error in utility package

* refactor: Azure SD, remove SD struct from config

* refactor: DNS SD, remove SD struct from config into dns package

* refactor: ec2 SD, move SD struct from config into the ec2 package

* refactor: file SD, move SD struct from config to file discovery package

* refactor: gce, move SD struct from config to gce discovery package

* refactor: move HTTPClientConfig and URL into util/config, fix import error in httputil

* refactor: consul, move SD struct from config into consul discovery package

* refactor: marathon, move SD struct from config into marathon discovery package

* refactor: triton, move SD struct from config to triton discovery package, fix test

* refactor: zookeeper, move SD structs from config to zookeeper discovery package

* refactor: openstack, remove SD struct from config, move into openstack discovery package

* refactor: kubernetes, move SD struct from config into kubernetes discovery package

* refactor: notifier, use targetgroup package instead of config

* refactor: tests for file, marathon, triton SD - use targetgroup package instead of config.TargetGroup

* refactor: retrieval, use targetgroup package instead of config.TargetGroup

* refactor: storage, use config util package

* refactor: discovery manager, use targetgroup package instead of config.TargetGroup

* refactor: use HTTPClient and TLS config from configUtil instead of config

* refactor: tests, use targetgroup package instead of config.TargetGroup

* refactor: fix tagetgroup.Group pointers that were removed by mistake

* refactor: openstack, kubernetes: drop prefixes

* refactor: remove import aliases forced due to vscode bug

* refactor: move main SD struct out of config into discovery/config

* refactor: rename configUtil to config_util

* refactor: rename yamlUtil to yaml_config

* refactor: kubernetes, remove prefixes

* refactor: move the TargetGroup package to discovery/

* refactor: fix order of imports
2017-12-29 21:01:34 +01:00
Brian Brazil ecc24b554d
Hide block duration flags. (#3618)
Users are starting to use these mistakenly thinking they'll help
with issues, and thus causing some confusion.
Thus hide them and make it clear that they're only there for testing
reasons.
2017-12-24 12:13:48 +00:00
Krasi Georgiev c94fa731aa bypass the proxy for the tests 2017-12-20 18:21:10 +00:00
Krasi Georgiev ad66476c4f fix flaky main.go test and simplify a bit 2017-12-19 15:07:49 +00:00
Fabian Reinartz 2881d73ed8
Merge pull request #3362 from krasi-georgiev/discovery-refactoring
Decouple the discovery and refactor the retrieval package
2017-12-19 12:56:34 +01:00
Goutham Veeramachaneni 9c9f96b2c0
Merge pull request #3529 from krasi-georgiev/main-integration-test
main.go integration test for Startup interrupting.
2017-12-18 22:12:13 -06:00
Krasi Georgiev 587dec9eb9 rebased and resolved conflicts with the new Discovery GUI page
Signed-off-by: Krasi Georgiev <krasi.root@gmail.com>
2017-12-18 20:10:03 +00:00
Krasi Georgiev 1ec76d1950 rearange the contexts variables and logic
split the groupsMerge function to set and get
other small nits
2017-12-18 17:23:47 +00:00
Krasi Georgiev 6ff1d5c51e add the scrape manager config reloader
handle errors with invalid scrape config
2017-12-18 17:23:47 +00:00
Krasi Georgiev b0d4f6ee08 resolved merge confilc in main.go 2017-12-18 17:23:46 +00:00
Krasi Georgiev c5cb0d2910 simplify naming and API. 2017-12-18 17:22:50 +00:00
Krasi Georgiev 9c61f0e8a0 scrape pool doesn't rely on context as Stop() needs to be blocking to prevent Scrape loops trying to write to a closed TSDB storage. 2017-12-18 17:22:49 +00:00
Krasi Georgiev e405e2f1ea refactored discovery 2017-12-18 17:22:49 +00:00
pasquier-s 2440696961 Log file descriptor limits at startup (#3567)
Fixes #3564
2017-12-11 13:01:53 +00:00
Alberto Cortés 29da2fb9cd testutil: update to go1.9 testing.Helper 2017-12-08 19:06:53 +01:00
Alberto Cortés 8f6a9f7833 config: simplify tests by using testutil.NotOk (#3289)
Also include filename in all LoadFile errors

Also add mesage to testuitl.NotOk so we can identify failing tests when
using table driven tests.
2017-12-08 16:52:25 +00:00
Krasi Georgiev 740662644e write to temp dir and remove it at the end.
Signed-off-by: Krasi Georgiev <krasi.root@gmail.com>
2017-12-06 10:45:58 +00:00
Brian Brazil b97f4cf48c Add metrics for rule group interval and last duration. 2017-12-04 11:44:38 +00:00
Krasi Georgiev 2c2a962da3 main.go integration test for Startup interrupting. 2017-12-01 10:58:01 +00:00
Goutham Veeramachaneni 823b7f90b3
Use the files globbed files and not the files in cfg
Signed-off-by: Goutham Veeramachaneni <cs14btech11014@iith.ac.in>
2017-11-30 17:08:34 +05:30
Fabian Reinartz 62461379b7 rules: decouple notifier packages
The dependency on the notifier packages caused a transitive dependency
on discovery and with that all client libraries our service discovery
uses.
2017-11-27 16:38:14 +01:00
Fabian Reinartz 4d964a0a0d rules: make glob expansion a concern of main 2017-11-24 08:22:57 +01:00
Fabian Reinartz bd9f7460eb rules: remove config package dependency 2017-11-24 07:57:54 +01:00
Fabian Reinartz 2d0e3746ac rules: remove dependency on promql.Engine 2017-11-24 07:57:54 +01:00
Krasi Georgiev e2f4850fea Refactor main.go with oklog/pkg/group actors pattern 2017-11-11 12:33:15 +00:00
Thibault Chataigner fc4406201e Tsdb StartTime : Use a simplier way to compute StartTime 2017-10-25 17:41:00 +02:00
Julius Volz 099df0c5f0 Migrate "golang.org/x/net/context" -> "context" (#3333)
In some places, where ctxhttp or gRPC are concerned, we still need to use the
old contexts.
2017-10-24 21:21:42 -07:00
Julius Volz 9d43176ab3 Remove unused printVersion variable (#3335)
Kingpin now automatically does this via --version.
2017-10-23 08:50:13 +01:00
Julius Volz 82c5b98496 Capitalize Prometheus in startup message (#3332)
Hey, branding :)
2017-10-23 08:49:28 +01:00
Thibault Chataigner bf4a279a91 Remote storage reads based on oldest timestamp in primary storage (#3129)
Currently all read queries are simply pushed to remote read clients.
This is fine, except for remote storage for wich it unefficient and
make query slower even if remote read is unnecessary.
So we need instead to compare the oldest timestamp in primary/local
storage with the query range lower boundary. If the oldest timestamp
is older than the mint parameter, then there is no need for remote read.
This is an optionnal behavior per remote read client.

Signed-off-by: Thibault Chataigner <t.chataigner@criteo.com>
2017-10-18 12:08:14 +01:00
Julius Volz 5f715f5733 Fix typo in flag description (#3302) 2017-10-16 23:00:05 +01:00
Tobias Schmidt 3589f2f1d4 Merge pull request #3285 from jlevesy/use-testutils-in-cmd-subpackage
Use testutil assertion helpers in cmd package
2017-10-13 00:12:39 +02:00
Julien Levesy d7b4fa8d78 use testutil assertions in the cmd/prometheus package 2017-10-12 13:45:38 +02:00
Mathieu Pasquet 38afa507bb Provide better errors messages in commandline
Instead or only printing the help message, which is not always helpful.
For example, when upgrading from prometheus v1, the retention time value
format has changed and now only accepts one unit (e.g. "15d") where it
previously allowed more complex strings (e.g. "360h0m0s").

This commit provides the error message as an explanation for the parsing
failure.
2017-10-09 16:25:50 +02:00
Marc Sluiter 6a633eece1 Added go-conntrack for monitoring http connections (#3241)
Added metrics for in- and outgoing traffic with go-conntrack.
2017-10-06 11:22:19 +01:00
Fabian Reinartz 2d0b8e8b94 Merge branch 'master' into dev-2.0 2017-10-05 13:09:18 +02:00
Paul Gier 08af129b4d cmd/prometheus: don't allow quotes at beginning or end of url
This prevents accidental copy/paste error where a the web.external-url
or alertmanager.url params could have an extra set of quotes.
See also: https://github.com/prometheus/prometheus/issues/1229
2017-10-04 10:10:02 -05:00
Paul Gier f79b55d057 cmd/prometheus: remove govalidator for url validation
The usage of govalidator is redundant with the call to url.Parse for
url validation. Removing it has the following benefits:

 - The explicit error message is displayed instead of just a generic
   valid/invalid message
 - Slightly smaller code with one fewer external dependency
 - Speed improvement by removing duplicate call to url.Parse (inside
   govalidator.IsURL()
 - Resolves issue #2717

The only potential drawback of removing govalidator is that certain
URLs will be considered valid which were previously invalid. For example:

 - URLs with hostnames that start and/or end with an underscore (http://_example.com_)
 - URLs with hostnames that contain some special characters (http://foo&*bar.org)

These are valid URIs according to RFC 3986 and valid domain names per RFC 2181,
however they are not valid hostnames per RFC 952.
2017-10-04 10:08:34 -05:00
Fabian Reinartz 7b02bfee0a web: start web handler while TSDB is starting up 2017-09-20 15:03:19 +02:00
Goutham Veeramachaneni f5aed810f9 logging: Port to common/promlog
Signed-off-by: Goutham Veeramachaneni <cs14btech11014@iith.ac.in>
2017-09-15 12:40:50 +05:30
Fabian Reinartz d21f149745 *: migrate to go-kit/log 2017-09-08 22:01:51 +05:30
Fabian Reinartz c70379e1c7 Merge branch 'dev-2.0' of github.com:prometheus/prometheus into dev-2.0 2017-09-04 13:10:50 +02:00
Fabian Reinartz fffe51fb03 Add mutex and block profiling via envvar 2017-09-04 13:10:32 +02:00
Ben Kochie 59aca4138b Fix staticcheck issues. 2017-08-28 17:29:01 +02:00
Matt Bostock 64973f5c65 cmd/prometheus: Fix capitalisation in log line (#3123)
Change 'Ready' to 'ready'.
2017-08-28 11:03:25 +01:00
Mark Adams 77c816b309 Fix pprof endpoints when -web.route-prefix or -web.external-url is used (#3054)
Whenever a route prefix is applied, the router prepends the prefix to
the URL path on the request. For most handlers, this is not an issue
because the request's path is only used for routing and is not actually
needed by the handler itself. However, Prometheus delegates the handling
of the /debug/* endpoints to the http.DefaultServeMux which has it's own
routing logic that depends on the url.Path. As a result, whenever a
prefix is applied, the prefixed URL is passed to the DefaultServeMux
which has no awareness of the prefix and returns a 404.

This change fixes the issue by creating a new serveDebug handler which
routes requests /debug/* requests to appropriate net/http/pprof handler
and removing the net/http/pprof import in cmd/prometheus since it is no
longer necessary.

Fixes #2183.
2017-08-23 00:00:56 +01:00
Callum Styan 8912f81ffe check if file_sd files exist in checkConfig 2017-08-22 15:25:30 -07:00
Fabian Reinartz 25f3e1c424 Merge branch 'master' into mergemaster 2017-08-10 17:04:25 +02:00
KalivarapuReshma 686050d816 Change -config.file to --config.file in Readme and error message 2017-08-08 12:49:35 +05:30
emluque ff54c5c11a 2831 Add Healthy and Ready endpoints 2017-08-07 17:34:04 -03:00
Fabian Reinartz 4d3d8ee229 Merge pull request #2850 from tomwilkie/dev-2.0-remote
Remote APIs for v2
2017-08-03 13:39:09 +02:00
Julius Volz cc50aa2c6b main: Consistently end flag descriptions with periods. (#2977) 2017-07-20 23:48:35 +02:00
Tom Wilkie 2dda5775e3 Initial port of remote storage to v2. 2017-07-12 12:27:57 +01:00
Fabian Reinartz 32226e30f5 Guard reload and quit endpoints by flag 2017-07-11 14:25:07 +02:00
Fabian Reinartz 45ac064669 web: disable Amin APIs by default 2017-07-10 09:29:41 +02:00
Fabian Reinartz ccf9e62972 *: add admin grpc API 2017-07-10 09:14:14 +02:00
Fabian Reinartz be32afd6df cmd/prometheus: add back tsdb.no-lockfile flag 2017-06-22 15:02:10 +02:00
Goutham Veeramachaneni f9202c6511
Move from .yaml to .yml in update rules
Signed-off-by: Goutham Veeramachaneni <cs14btech11014@iith.ac.in>
2017-06-21 18:38:37 +05:30
Goutham Veeramachaneni e3701077c3
Move promtool to kingpin
Signed-off-by: Goutham Veeramachaneni <cs14btech11014@iith.ac.in>
2017-06-21 17:42:57 +05:30
Fabian Reinartz 867b8d108f cmd/prometheus: cleanup 2017-06-21 11:38:13 +02:00
Fabian Reinartz 34ab7a885a cmd/prometheus: switch to kingpin 2017-06-20 17:38:01 +02:00
Goutham Veeramachaneni 592cb00c2f
Remove version from RuleGroups
Signed-off-by: Goutham Veeramachaneni <cs14btech11014@iith.ac.in>
2017-06-19 16:38:46 +05:30
Goutham Veeramachaneni 37e7b69f56
Merge remote-tracking branch 'upstream/dev-2.0' into rulegroups
Signed-off-by: Goutham Veeramachaneni <cs14btech11014@iith.ac.in>
2017-06-19 16:34:55 +05:30
Goutham Veeramachaneni 67dc73fd59
Flag changes for 2.0
Fixes: prometheus/prometheus#2087

Signed-off-by: Goutham Veeramachaneni <cs14btech11014@iith.ac.in>
2017-06-16 20:21:41 +05:30
Goutham Veeramachaneni d407bd150c Consolidate the duration params in CLI
* All CLI params moved to model.Duration

Signed-off-by: Goutham Veeramachaneni <cs14btech11014@iith.ac.in>
2017-06-16 20:20:57 +05:30
Goutham Veeramachaneni 6b70a4d850
Incorporate PR feedback
* Move fingerprint to Hash()
* Move away from tsdb.MultiError
* 0777 -> 0666 for files
* checkOverflow of extra fields

Signed-off-by: Goutham Veeramachaneni <cs14btech11014@iith.ac.in>
2017-06-16 16:44:33 +05:30
Goutham Veeramachaneni 6c1617fd13
Simplify usage string
Signed-off-by: Goutham Veeramachaneni <cs14btech11014@iith.ac.in>
2017-06-16 15:55:13 +05:30
Goutham Veeramachaneni 507790a357
Rework logging to use explicitly passed logger
Mostly cleaned up the global logger use. Still some uses in discovery
package.

Signed-off-by: Goutham Veeramachaneni <cs14btech11014@iith.ac.in>
2017-06-16 15:52:44 +05:30
Goutham Veeramachaneni dc69645e92
Move back to go-yaml
Signed-off-by: Goutham Veeramachaneni <cs14btech11014@iith.ac.in>
2017-06-16 10:46:21 +05:30
Goutham Veeramachaneni 8abb91f656
Move CLI commander to cobra
Signed-off-by: Goutham Veeramachaneni <cs14btech11014@iith.ac.in>
2017-06-15 16:38:08 +05:30
Goutham Veeramachaneni 1c08743721
Update check-rules to new format.
Signed-off-by: Goutham Veeramachaneni <cs14btech11014@iith.ac.in>
2017-06-14 13:32:26 +05:30
Goutham Veeramachaneni cea1e99f78
Add update-rules command to promtool
Signed-off-by: Goutham Veeramachaneni <cs14btech11014@iith.ac.in>
2017-06-14 11:38:54 +05:30
Fabian Reinartz 669075c6b9 Merge branch 'master' into dev-2.0 2017-06-06 09:36:51 +02:00
Chris Goller 42de0ae013 Use log.Logger interface for all discovery services 2017-06-01 11:25:55 -05:00
Conor Broderick 6766123f93 Replace regex with Secret type and remarshal config to hide secrets (#2775) 2017-05-29 12:46:23 +01:00
Fabian Reinartz 4c31061251 Merge branch 'master' into dev-2.0 2017-05-24 15:36:17 +02:00
Fabian Reinartz d289dc55c3 storage: update TSDB 2017-05-22 11:53:08 +02:00
Shashank Varanasi dea60bb553 Fix malformed uname string (#2727)
* Fix malformed uname string

* Make fix better

* Reformat code for simplicity
2017-05-16 18:44:11 +02:00
Fabian Reinartz 06c2b76cd4 Merge branch 'master' into uptsdb 2017-05-16 16:48:37 +02:00
Shashank Varanasi 61235fd851 Print system information (uname) at Prometheus startup (#2709)
* Print uname on prom startup

* Make uname file linux-only

* Add missing license headers

Add missing license headers

* Print OS when uname is not available

* Print only OS name when uname not available

* Remove extra space, fix cmd/prometheus/main.go license header

* Add fix for int8 and uint8 systems

* Better formatting for build tags in cmd/prometheus/uname files

* Remove newline
2017-05-13 20:42:29 +02:00
Frederic Branczyk c50a3eccce
prometheus: default max-block-duration to 10% of retention 2017-05-12 11:48:51 +02:00
Michal Witkowski 4177c35eba Fixup sighup for P2 TSDB init #2699 2017-05-09 17:00:54 +01:00
Fabian Reinartz 9b175d48cb Add flag to disable TSDB lock file 2017-05-09 12:56:51 +02:00
Fabian Reinartz 73b8ff0ddc Merge branch 'master' into dev-2.0 2017-04-27 10:19:55 +02:00
Matt Layher 283756c503 Initial commit of 'promtool check-metrics', promlint package (#2605) 2017-04-13 23:53:41 +02:00
Fabian Reinartz 757cba7c31 cmd/prometheus: Undo GOGC adjustment 2017-04-10 16:22:01 +02:00
beorn7 f20b84e816 flags: Improve doc strings for checkpoint flags 2017-04-07 13:10:12 +02:00
Fabian Reinartz 8ffc851147 Merge branch 'master' into dev-2.0 2017-04-04 15:17:56 +02:00
Julius Volz 589061919a Merge pull request #2465 from Gouthamve/alert-metrics-2429
Better Metrics For Alerts
2017-03-31 21:45:05 +02:00
Goutham Veeramachaneni f27ce34a13
Use Registerer to Register All Metrics
* Made Metric a Gauge so that it can be registered.
2017-04-01 00:14:30 +05:30
Goutham Veeramachaneni 0d0c9d5440
Move Registerer to Config Struct in Notifier 2017-03-31 21:20:12 +05:30
Björn Rabenstein 29f05680a2 Merge pull request #2528 from prometheus/beorn7/storage2
main.go: Set GOGC to 40 by default
2017-03-27 15:00:37 +02:00
Björn Rabenstein e63d079b59 Merge pull request #2527 from prometheus/beorn7/storage
storage: Evict chunks and calculate persistence pressure...
2017-03-27 14:49:42 +02:00
Julius Volz b5b0e00923 Merge pull request #2499 from prometheus/remote-read
Remote Read
2017-03-27 14:43:44 +02:00
beorn7 434ab2a6a3 storage: Evict chunks and calculate persistence pressure based on target heap size
This is a fairly easy attempt to dynamically evict chunks based on the
heap size. A target heap size has to be set as a command line flage,
so that users can essentially say "utilize 4GiB of RAM, and please
don't OOM".

The -storage.local.max-chunks-to-persist and
-storage.local.memory-chunks flags are deprecated by this
change. Backwards compatibility is provided by ignoring
-storage.local.max-chunks-to-persist and use
-storage.local.memory-chunks to set the new
-storage.local.target-heap-size to a reasonable (and conservative)
value (both with a warning).

This also makes the metrics intstrumentation more consistent (in
naming and implementation) and cleans up a few quirks in the tests.

Answers to anticipated comments:

There is a chance that Go 1.9 will allow programs better control over
the Go memory management. I don't expect those changes to be in
contradiction with the approach here, but I do expect them to
complement them and allow them to be more precise and controlled. In
any case, once those Go changes are available, this code has to be
revisted.

One might be tempted to let the user specify an estimated value for
the RSS usage, and then internall set a target heap size of a certain
fraction of that. (In my experience, 2/3 is a fairly safe bet.)
However, investigations have shown that RSS size and its relation to
the heap size is really really complicated. It depends on so many
factors that I wouldn't even start listing them in a commit
description. It depends on many circumstances and not at least on the
risk trade-off of each individual user between RAM utilization and
probability of OOMing during a RAM usage peak. To not add even more to
the confusion, we need to stick to the well-defined number we also use
in the targeting here, the sum of the sizes of heap objects.
2017-03-27 14:33:50 +02:00
beorn7 96a303b348 storage: Use staleness delta as head chunk timeout
Currently, if a series stops to exist, its head chunk will be kept
open for an hour. That prevents it from being persisted. Which
prevents it from being evicted. Which prevents the series from being
archived.

Most of the time, once no sample has been added to a series within the
staleness limit, we can be pretty confident that this series will not
receive samples anymore. The whole chain as described above can be
started after 5m instead of 1h. In the relaxed case, this doesn't
change a lot as the head chunk timeout is only checked during series
maintenance, and usually, a series is only maintained every six
hours. However, there is the typical scenario where a large service is
deployed, the deoply turns out to be bad, and then it is deployed
again within minutes, and quite quickly the number of time series has
tripled. That's the point where the Prometheus server is stressed and
switches (rightfully) into rushed mode. In that mode, time series are
processed as quickly as possible, but all of that is in vein if all of
those recently ended time series cannot be persisted yet for another
hour. In that scenario, this change will help most, and it's exactly
the scenario where help is most desperately needed.
2017-03-26 23:44:50 +02:00
beorn7 04ccf84559 main.go: Set GOGC to 40 by default
Rationale: The default value for GOGC is 100, i.e. a garbage collected
is initialized once as many heap space has been allocated as was in
use after the last GC was done. This ratio doesn't make a lot of sense
in Prometheus, as typically about 60% of the heap is allocated for
long-lived memory chunks (most of which are around for many hours if
not days). Thus, short-lived heap objects are accumulated for quite
some time until they finally match the large amount of memory used by
bulk memory chunks and a gigantic GC cyle is invoked. With GOGC=40, we
are essentially reinstating "normal" GC behavior by acknowledging that
about 60% of the heap are used for long-term bulk storage.

The median Prometheus production server at SoundCloud runs a GC cycle
every 90 seconds. With GOGC=40, a GC cycle is run every 35 seconds
(which is still not very often). However, the effective RAM usage is
now reduced by about 30%. If settings are updated to utilize more RAM,
the time between GC cycles goes up again (as the heap size is larger
with more long-lived memory chunks, but the frequency of creating
short-lived heap objects does not change). On a quite busy large
Prometheus server, the timing changed from one GC run every 20s to one
GC run every 12s.

In the former case (just changing GOGC, leave everything else as it
is), the CPU usage increases by about 10% (on a mid-size referenc
server from 8.1 to 8.9). If settings are adjusted, the CPU
consumptions increases more drastically (from 8 cores to 13 cores on a
large reference server), despite GCs happening more rarely, presumably
because a 50% larger set of memory chunks is managed now. Having more
memory chunks is good in many regards, and most servers are running
out of memory long before they run out of CPU cycles, so the tradeoff
is overwhelmingly positive in most cases.

Power users can still set the GOGC environment variable as usual, as
the implementation in this commit honors an explicitly set variable.
2017-03-26 21:55:37 +02:00
Julius Volz 8fda83ea12 Make rules only read local data 2017-03-21 00:50:04 +01:00
Julius Volz 406b65d0dc Rename remote.Storage to remote.Writer 2017-03-20 13:15:28 +01:00
Julius Volz 02395a224d [WIP] Remote Read 2017-03-20 13:13:44 +01:00
Fabian Reinartz b586781283 *: update tsdb vendoring and add retention flag 2017-03-17 16:06:04 +01:00
Goutham Veeramachaneni f35816613e
Refactored Notifier to use Registerer
* Brought metrics back into Notifier

Notifier still implements a Collector. Check if that is needed.
2017-03-03 02:53:16 +05:30
Fabian Reinartz 9304179ef7 Merge branch 'master' into dev-2.0 2017-03-02 08:16:58 +01:00
Fabian Reinartz 4397b4d508 *: pass Prometheus registry into storage 2017-02-28 09:33:14 +01:00
Julius Volz beb3c4b389 Remove legacy remote storage implementations
This removes legacy support for specific remote storage systems in favor
of only offering the generic remote write protocol. An example bridge
application that translates from the generic protocol to each of those
legacy backends is still provided at:

documentation/examples/remote_storage/remote_storage_bridge

See also https://github.com/prometheus/prometheus/issues/10

The next step in the plan is to re-add support for multiple remote
storages.
2017-02-14 17:52:05 +01:00
Fabian Reinartz ea3ba338dd main: add flags for new storage 2017-02-05 18:22:06 +01:00
Fabian Reinartz 5772f1a7ba retrieval/storage: adapt to new interface
This simplifies the interface to two add methods for
appends with labels or faster reference numbers.
2017-02-02 13:05:46 +01:00
Fabian Reinartz 1d3cdd0d67 Merge branch 'master' into dev-2.0-rebase 2017-01-30 17:43:01 +01:00
Fabian Reinartz 035976b275 retrieval: handle not found error correctly 2017-01-20 11:27:01 +01:00
Bartek Plotka 579e33f19a Fixed style issues. 2017-01-16 16:45:58 +00:00
Bartek Plotka d7febe97fa Fixed regression in -alertmanager.url flag. Basic auth was ignored.
- Included basic auth parsing while parsing to AlertmanagerConfig
- Added test case

Signed-off-by: Bartek Plotka <bwplotka@gmail.com>
2017-01-16 16:39:20 +00:00
Fabian Reinartz ad9bc62e4c storage: extend appender and adapt it 2017-01-13 14:48:01 +01:00
Fabian Reinartz e631a1260d retrieval: use separate appender per target 2016-12-30 21:35:35 +01:00
Fabian Reinartz 68dc358496 cmd/prometheus: remove tests for old flags 2016-12-29 16:55:22 +01:00
Fabian Reinartz f8fc1f5bb2 *: migrate ingestion to new batch Appender 2016-12-29 11:03:56 +01:00
Fabian Reinartz 1becee3f6c main: remove Alertmanager legacy flag configuration 2016-12-25 00:43:41 +01:00
Fabian Reinartz 15a931dbdb promql: migrate model types, use tsdb interfaces 2016-12-24 00:39:52 +01:00
Fabian Reinartz 8b84ee5ee6 storage: remove old storage
This removes all old storage files and only keeps interfaces
to still allow the code to compile.
2016-12-22 23:33:32 +01:00
Fabian Reinartz 11a731ba82 remote: remove hard-coded remote storages
This commit removes the flag-configured remote storage integrations
in favor of the generic remote write path.
2016-12-22 23:17:35 +01:00
Erdem Agaoglu 054f8ebbfb Increase default max-connections 2016-12-06 17:45:19 +03:00
Erdem Agaoglu e487477a17 LimitListener to limit max number of connections
This also drops tcp keep-alive in ListenAndServe but it's no longer
necessary since we now close idle connections long before that.
2016-12-06 12:45:59 +03:00
Erdem Agaoglu 9986b28380 Set read-timeout for http.Server
This also specifies a timeout for idle client connections, which may
cause "too many open files" errors.
See #2238
2016-12-01 16:29:45 +03:00
Fabian Reinartz 3fb4d1191b config: rename AlertingConfig, resolve file paths 2016-11-24 15:19:37 +01:00
Fabian Reinartz d4deb8bbf2 web: show discovered Alertmanagers in UI 2016-11-24 15:06:50 +01:00
Fabian Reinartz f210d96497 notifier: use dynamic service discovery 2016-11-23 18:23:37 +01:00
Fabian Reinartz 200bbe1bad config: extract SD and HTTPClient configurations 2016-11-23 18:23:37 +01:00